date:20240411

[jira] [Updated] (SPARK-47828) DataFrameWriterV2.overwrite fails with invalid plan

2024-04-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47828:
---
Labels: pull-request-available  (was: )

> DataFrameWriterV2.overwrite fails with invalid plan
> ---
>
> Key: SPARK-47828
> URL: https://issues.apache.org/jira/browse/SPARK-47828
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.4.2, 4.0.0, 3.5.1
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47828) DataFrameWriterV2.overwrite fails with invalid plan

2024-04-11 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-47828:
--
Affects Version/s: 3.4.2

> DataFrameWriterV2.overwrite fails with invalid plan
> ---
>
> Key: SPARK-47828
> URL: https://issues.apache.org/jira/browse/SPARK-47828
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.4.2, 4.0.0, 3.5.1
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47792) Make the value of MDC can support `null`

2024-04-11 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-47792.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45975
[https://github.com/apache/spark/pull/45975]

> Make the value of MDC can support `null`
> 
>
> Key: SPARK-47792
> URL: https://issues.apache.org/jira/browse/SPARK-47792
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47784) [State API v2] Merge TimeoutMode and TTLMode into TimeMode

2024-04-11 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-47784.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45960
[https://github.com/apache/spark/pull/45960]

> [State API v2] Merge TimeoutMode and TTLMode into TimeMode
> --
>
> Key: SPARK-47784
> URL: https://issues.apache.org/jira/browse/SPARK-47784
> Project: Spark
>  Issue Type: Story
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Bhuwan Sahni
>Assignee: Bhuwan Sahni
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Currently, users need to specify the notion of time 
> (ProcessingTime/EventTime) for timers and ttl separately. This allows users 
> to use a single parameter.
> We do not expect users to use mix/match EventTime/ProcessingTime for timers 
> and ttl in a single query because it makes hard to reason about the time 
> semantics (when will timer be fired?, when will the state be evicted? etc.). 
> Its simpler to stick to one notion of time throughout timers and ttl.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47828) DataFrameWriterV2.overwrite fails with invalid plan

2024-04-11 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-47828:
--
Issue Type: Bug  (was: Improvement)

> DataFrameWriterV2.overwrite fails with invalid plan
> ---
>
> Key: SPARK-47828
> URL: https://issues.apache.org/jira/browse/SPARK-47828
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 4.0.0, 3.5.1
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47828) DataFrameWriterV2.overwrite fails with invalid plan

2024-04-11 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-47828:
-

 Summary: DataFrameWriterV2.overwrite fails with invalid plan
 Key: SPARK-47828
 URL: https://issues.apache.org/jira/browse/SPARK-47828
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 3.5.1, 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47594) Connector module: Migrate logInfo with variables to structured logging framework

2024-04-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47594:
---
Labels: pull-request-available  (was: )

> Connector module: Migrate logInfo with variables to structured logging 
> framework
> 
>
> Key: SPARK-47594
> URL: https://issues.apache.org/jira/browse/SPARK-47594
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Gengliang Wang
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44444) Enabled ANSI mode by default

2024-04-11 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-4?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836397#comment-17836397
 ] 

Dongjoon Hyun commented on SPARK-4:
---

Hi, All. Here is the discussion thread I sent a few minutes ago.
 * [https://lists.apache.org/thread/ztlwoz1v1sn81ssks12tb19x37zozxlz]

> Enabled ANSI mode by default
> 
>
> Key: SPARK-4
> URL: https://issues.apache.org/jira/browse/SPARK-4
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: pull-request-available
>
> To avoid data issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44444) Enabled ANSI mode by default

2024-04-11 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-4?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836398#comment-17836398
 ] 

L. C. Hsieh commented on SPARK-4:
-

Thank you [~dongjoon]

> Enabled ANSI mode by default
> 
>
> Key: SPARK-4
> URL: https://issues.apache.org/jira/browse/SPARK-4
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: pull-request-available
>
> To avoid data issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44444) Enabled ANSI mode by default

2024-04-11 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-4?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836389#comment-17836389
 ] 

Dongjoon Hyun commented on SPARK-4:
---

I made a draft PR and will initiate the discussion thread for this JIRA, 
[~yumwang] , [~LuciferYang] , [~HF] , [~viirya] , [~yao] .

> Enabled ANSI mode by default
> 
>
> Key: SPARK-4
> URL: https://issues.apache.org/jira/browse/SPARK-4
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: pull-request-available
>
> To avoid data issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47795) Supplement the doc of job schedule for K8S

2024-04-11 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-47795.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45982
[https://github.com/apache/spark/pull/45982]

> Supplement the doc of job schedule for K8S
> --
>
> Key: SPARK-47795
> URL: https://issues.apache.org/jira/browse/SPARK-47795
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47813) Replace getArrayDimension with updateExtraColumnMeta

2024-04-11 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-47813.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46006
[https://github.com/apache/spark/pull/46006]

> Replace getArrayDimension with updateExtraColumnMeta 
> -
>
> Key: SPARK-47813
> URL: https://issues.apache.org/jira/browse/SPARK-47813
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47827) Missing warnings for deprecated features

2024-04-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47827:
---
Labels: pull-request-available  (was: )

> Missing warnings for deprecated features
> 
>
> Key: SPARK-47827
> URL: https://issues.apache.org/jira/browse/SPARK-47827
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
>
> There are some APIs will be removed but missing deprecation warnings



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47827) Missing warnings for deprecated features

2024-04-11 Thread Haejoon Lee (Jira)

Haejoon Lee created SPARK-47827:
---

 Summary: Missing warnings for deprecated features
 Key: SPARK-47827
 URL: https://issues.apache.org/jira/browse/SPARK-47827
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Haejoon Lee


There are some APIs will be removed but missing deprecation warnings



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47174) Client Side Listener - Server side implementation

2024-04-11 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-47174.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45988
[https://github.com/apache/spark/pull/45988]

> Client Side Listener - Server side implementation
> -
>
> Key: SPARK-47174
> URL: https://issues.apache.org/jira/browse/SPARK-47174
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, SS
>Affects Versions: 4.0.0
>Reporter: Wei Liu
>Assignee: Wei Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47174) Client Side Listener - Server side implementation

2024-04-11 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-47174:


Assignee: Wei Liu

> Client Side Listener - Server side implementation
> -
>
> Key: SPARK-47174
> URL: https://issues.apache.org/jira/browse/SPARK-47174
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, SS
>Affects Versions: 4.0.0
>Reporter: Wei Liu
>Assignee: Wei Liu
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47824) Nondeterminism in pyspark.pandas.series.asof

2024-04-11 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-47824.
--
Fix Version/s: 3.4.3
   3.5.2
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 46018
[https://github.com/apache/spark/pull/46018]

> Nondeterminism in pyspark.pandas.series.asof
> 
>
> Key: SPARK-47824
> URL: https://issues.apache.org/jira/browse/SPARK-47824
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.4.2, 4.0.0, 3.5.1, 3.3.4
>Reporter: Mark Jarvin
>Assignee: Mark Jarvin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.3, 3.5.2, 4.0.0
>
>
> `max_by` in `pyspark.pandas.series.asof` uses a literal string instead of a 
> generated column as its ordering condition, resulting in nondeterminism.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47824) Nondeterminism in pyspark.pandas.series.asof

2024-04-11 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-47824:


Assignee: Mark Jarvin

> Nondeterminism in pyspark.pandas.series.asof
> 
>
> Key: SPARK-47824
> URL: https://issues.apache.org/jira/browse/SPARK-47824
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.4.2, 4.0.0, 3.5.1, 3.3.4
>Reporter: Mark Jarvin
>Assignee: Mark Jarvin
>Priority: Major
>  Labels: pull-request-available
>
> `max_by` in `pyspark.pandas.series.asof` uses a literal string instead of a 
> generated column as its ordering condition, resulting in nondeterminism.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47826) Add VariantVal for PySpark

2024-04-11 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-47826:


Assignee: Gene Pang

> Add VariantVal for PySpark
> --
>
> Key: SPARK-47826
> URL: https://issues.apache.org/jira/browse/SPARK-47826
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Gene Pang
>Assignee: Gene Pang
>Priority: Major
> Fix For: 4.0.0
>
>
> Add a `VariantVal` implementation for PySpark. It includes convenience 
> methods to convert the Variant to a string, or to a Python object, so that 
> users can more easily work with Variant data.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47826) Add VariantVal for PySpark

2024-04-11 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-47826.
--
Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/45826

> Add VariantVal for PySpark
> --
>
> Key: SPARK-47826
> URL: https://issues.apache.org/jira/browse/SPARK-47826
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Gene Pang
>Priority: Major
> Fix For: 4.0.0
>
>
> Add a `VariantVal` implementation for PySpark. It includes convenience 
> methods to convert the Variant to a string, or to a Python object, so that 
> users can more easily work with Variant data.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47826) Add VariantVal for PySpark

2024-04-11 Thread Gene Pang (Jira)

Gene Pang created SPARK-47826:
-

 Summary: Add VariantVal for PySpark
 Key: SPARK-47826
 URL: https://issues.apache.org/jira/browse/SPARK-47826
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Gene Pang
 Fix For: 4.0.0


Add a `VariantVal` implementation for PySpark. It includes convenience methods 
to convert the Variant to a string, or to a Python object, so that users can 
more easily work with Variant data.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47811) Run ML tests for pyspark-connect package

2024-04-11 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-47811.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45941
[https://github.com/apache/spark/pull/45941]

> Run ML tests for pyspark-connect package
> 
>
> Key: SPARK-47811
> URL: https://issues.apache.org/jira/browse/SPARK-47811
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47811) Run ML tests for pyspark-connect package

2024-04-11 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-47811:


Assignee: Hyukjin Kwon

> Run ML tests for pyspark-connect package
> 
>
> Key: SPARK-47811
> URL: https://issues.apache.org/jira/browse/SPARK-47811
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47825) Make `KinesisTestUtils` & `WriteInputFormatTestDataGenerator` deprecated

2024-04-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47825:
---
Labels: pull-request-available  (was: )

> Make `KinesisTestUtils` & `WriteInputFormatTestDataGenerator` deprecated
> 
>
> Key: SPARK-47825
> URL: https://issues.apache.org/jira/browse/SPARK-47825
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 3.5.2
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47825) Make `KinesisTestUtils` & `WriteInputFormatTestDataGenerator` deprecated

2024-04-11 Thread BingKun Pan (Jira)

BingKun Pan created SPARK-47825:
---

 Summary: Make `KinesisTestUtils` & 
`WriteInputFormatTestDataGenerator` deprecated
 Key: SPARK-47825
 URL: https://issues.apache.org/jira/browse/SPARK-47825
 Project: Spark
  Issue Type: Improvement
  Components: DStreams
Affects Versions: 3.5.2
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47824) Nondeterminism in pyspark.pandas.series.asof

2024-04-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47824:
---
Labels: pull-request-available  (was: )

> Nondeterminism in pyspark.pandas.series.asof
> 
>
> Key: SPARK-47824
> URL: https://issues.apache.org/jira/browse/SPARK-47824
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.4.2, 4.0.0, 3.5.1, 3.3.4
>Reporter: Mark Jarvin
>Priority: Major
>  Labels: pull-request-available
>
> `max_by` in `pyspark.pandas.series.asof` uses a literal string instead of a 
> generated column as its ordering condition, resulting in nondeterminism.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47824) Nondeterminism in pyspark.pandas.series.asof

2024-04-11 Thread Mark Jarvin (Jira)

Mark Jarvin created SPARK-47824:
---

 Summary: Nondeterminism in pyspark.pandas.series.asof
 Key: SPARK-47824
 URL: https://issues.apache.org/jira/browse/SPARK-47824
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.3.4, 3.5.1, 3.4.2, 4.0.0
Reporter: Mark Jarvin


`max_by` in `pyspark.pandas.series.asof` uses a literal string instead of a 
generated column as its ordering condition, resulting in nondeterminism.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47814) Remove the `KinesisTestUtils` from `main` to `test`

2024-04-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47814:
-

Assignee: BingKun Pan

> Remove the `KinesisTestUtils` from `main` to `test`
> ---
>
> Key: SPARK-47814
> URL: https://issues.apache.org/jira/browse/SPARK-47814
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47814) Remove the `KinesisTestUtils` from `main` to `test`

2024-04-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47814.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46000
[https://github.com/apache/spark/pull/46000]

> Remove the `KinesisTestUtils` from `main` to `test`
> ---
>
> Key: SPARK-47814
> URL: https://issues.apache.org/jira/browse/SPARK-47814
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47823) Improve appName and getOrCreate usage for Spark Connect

2024-04-11 Thread Xinrong Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-47823:
-
Description: 
 

In Spark Connect
{code:java}
spark = SparkSession.builder.appName("...").getOrCreate(){code}
 

raises error

 
{code:java}
[CANNOT_CONFIGURE_SPARK_CONNECT_MASTER] Spark Connect server and Spark master 
cannot be configured together: Spark master [...], Spark Connect [...]{code}
 

We should ban the usage of appName in Spark Connect

 

  was:
 

In Spark Connect
{code:java}
spark = SparkSession.builder.appName("...").getOrCreate(){code}
 

raises error{{{}{}}}

 
{code:java}
[CANNOT_CONFIGURE_SPARK_CONNECT_MASTER] Spark Connect server and Spark master 
cannot be configured together: Spark master [...], Spark Connect [...]{code}
 

We should ban the usage of appName in Spark Connect

 


> Improve appName and getOrCreate usage for Spark Connect
> ---
>
> Key: SPARK-47823
> URL: https://issues.apache.org/jira/browse/SPARK-47823
> Project: Spark
>  Issue Type: Story
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Xinrong Meng
>Priority: Major
>
>  
> In Spark Connect
> {code:java}
> spark = SparkSession.builder.appName("...").getOrCreate(){code}
>  
> raises error
>  
> {code:java}
> [CANNOT_CONFIGURE_SPARK_CONNECT_MASTER] Spark Connect server and Spark master 
> cannot be configured together: Spark master [...], Spark Connect [...]{code}
>  
> We should ban the usage of appName in Spark Connect
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47823) Improve appName and getOrCreate usage for Spark Connect

2024-04-11 Thread Xinrong Meng (Jira)

Xinrong Meng created SPARK-47823:


 Summary: Improve appName and getOrCreate usage for Spark Connect
 Key: SPARK-47823
 URL: https://issues.apache.org/jira/browse/SPARK-47823
 Project: Spark
  Issue Type: Story
  Components: Connect, PySpark
Affects Versions: 4.0.0
Reporter: Xinrong Meng


 

In Spark Connect
{code:java}
spark = SparkSession.builder.appName("...").getOrCreate(){code}
 

raises error{{{}{}}}

 
{code:java}
[CANNOT_CONFIGURE_SPARK_CONNECT_MASTER] Spark Connect server and Spark master 
cannot be configured together: Spark master [...], Spark Connect [...]{code}
 

We should ban the usage of appName in Spark Connect

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47822) Prohibit Hash expressions from hashing Variant type

2024-04-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47822:
---
Labels: pull-request-available  (was: )

> Prohibit Hash expressions from hashing Variant type
> ---
>
> Key: SPARK-47822
> URL: https://issues.apache.org/jira/browse/SPARK-47822
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Harsh Motwani
>Priority: Major
>  Labels: pull-request-available
>
> Prohibiting Hash functions from being applied on the Variant type. This is 
> because they haven't been implemented on the variant type and crash during 
> execution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47822) Prohibit Hash expressions from hashing Variant type

2024-04-11 Thread Harsh Motwani (Jira)

Harsh Motwani created SPARK-47822:
-

 Summary: Prohibit Hash expressions from hashing Variant type
 Key: SPARK-47822
 URL: https://issues.apache.org/jira/browse/SPARK-47822
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Harsh Motwani


Prohibiting Hash functions from being applied on the Variant type. This is 
because they haven't been implemented on the variant type and crash during 
execution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47318) AuthEngine key exchange needs additional KDF round

2024-04-11 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-47318:

Affects Version/s: 3.5.0
   3.4.0

>  AuthEngine key exchange needs additional KDF round
> ---
>
> Key: SPARK-47318
> URL: https://issues.apache.org/jira/browse/SPARK-47318
> Project: Spark
>  Issue Type: Bug
>  Components: Security
>Affects Versions: 3.4.0, 3.5.0, 4.0.0
>Reporter: Steve Weis
>Assignee: Steve Weis
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> AuthEngine implements a bespoke [key exchange protocol 
> |[https://github.com/apache/spark/tree/master/common/network-common/src/main/java/org/apache/spark/network/crypto]|https://github.com/apache/spark/tree/master/common/network-common/src/main/java/org/apache/spark/network/crypto].]
>  based on the NNpsk0 Noise pattern and using X25519.
> The Spark code improperly uses the derived shared secret directly, which is 
> an encoded X coordinate. This should be passed into a KDF rather than used 
> directly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47318) AuthEngine key exchange needs additional KDF round

2024-04-11 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-47318.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45425
[https://github.com/apache/spark/pull/45425]

>  AuthEngine key exchange needs additional KDF round
> ---
>
> Key: SPARK-47318
> URL: https://issues.apache.org/jira/browse/SPARK-47318
> Project: Spark
>  Issue Type: Bug
>  Components: Security
>Affects Versions: 4.0.0
>Reporter: Steve Weis
>Assignee: Steve Weis
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> AuthEngine implements a bespoke [key exchange protocol 
> |[https://github.com/apache/spark/tree/master/common/network-common/src/main/java/org/apache/spark/network/crypto]|https://github.com/apache/spark/tree/master/common/network-common/src/main/java/org/apache/spark/network/crypto].]
>  based on the NNpsk0 Noise pattern and using X25519.
> The Spark code improperly uses the derived shared secret directly, which is 
> an encoded X coordinate. This should be passed into a KDF rather than used 
> directly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47818) Introduce plan cache in SparkConnectPlanner to improve performance of Analyze requests

2024-04-11 Thread Xi Lyu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xi Lyu updated SPARK-47818:
---
Description: 
While building the DataFrame step by step, each time a new DataFrame is 
generated with an empty schema, which is lazily computed on access. However, if 
a user's code frequently accesses the schema of these new DataFrames using 
methods such as `df.columns`, it will result in a large number of Analyze 
requests to the server. Each time, the entire plan needs to be reanalyzed, 
leading to poor performance, especially when constructing highly complex plans.

Now, by introducing plan cache in SparkConnectPlanner, we aim to reduce the 
overhead of repeated analysis during this process. This is achieved by saving 
significant computation if the resolved logical plan of a subtree of can be 
cached.

A minimal example of the problem:
{code:java}
import pyspark.sql.functions as F
df = spark.range(10)
for i in range(200):
  if str(i) not in df.columns: # <-- The df.columns call causes a new Analyze 
request in every iteration
    df = df.withColumn(str(i), F.col("id") + i)
df.show() {code}
With this patch, the performance of the above code improved from ~110s to ~5s.

  was:
While building the DataFrame step by step, each time a new DataFrame is 
generated with an empty schema, which is lazily computed on access. However, if 
a user's code frequently accesses the schema of these new DataFrames using 
methods such as `df.columns`, it will result in a large number of Analyze 
requests to the server. Each time, the entire plan needs to be reanalyzed, 
leading to poor performance, especially when constructing highly complex plans.

Now, by introducing plan cache in SparkConnectPlanner, we aim to reduce the 
overhead of repeated analysis during this process. This is achieved by saving 
significant computation if the resolved logical plan of a subtree of can be 
cached.

A minimal example of the problem:
{code:java}
import pyspark.sql.functions as F
df = spark.range(10)
for i in range(200):
  if str(i) not in df.columns: # <-- The df.columns call causes a new Analyze 
request in every iteration
    df = df.withColumn(str(i), F.col("id") + i)
df.show() {code}
With this patch, the performance of the above code improved from ~115s to ~5s.


> Introduce plan cache in SparkConnectPlanner to improve performance of Analyze 
> requests
> --
>
> Key: SPARK-47818
> URL: https://issues.apache.org/jira/browse/SPARK-47818
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Xi Lyu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> While building the DataFrame step by step, each time a new DataFrame is 
> generated with an empty schema, which is lazily computed on access. However, 
> if a user's code frequently accesses the schema of these new DataFrames using 
> methods such as `df.columns`, it will result in a large number of Analyze 
> requests to the server. Each time, the entire plan needs to be reanalyzed, 
> leading to poor performance, especially when constructing highly complex 
> plans.
> Now, by introducing plan cache in SparkConnectPlanner, we aim to reduce the 
> overhead of repeated analysis during this process. This is achieved by saving 
> significant computation if the resolved logical plan of a subtree of can be 
> cached.
> A minimal example of the problem:
> {code:java}
> import pyspark.sql.functions as F
> df = spark.range(10)
> for i in range(200):
>   if str(i) not in df.columns: # <-- The df.columns call causes a new Analyze 
> request in every iteration
>     df = df.withColumn(str(i), F.col("id") + i)
> df.show() {code}
> With this patch, the performance of the above code improved from ~110s to ~5s.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44444) Enabled ANSI mode by default

2024-04-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-4:
---
Labels: pull-request-available  (was: )

> Enabled ANSI mode by default
> 
>
> Key: SPARK-4
> URL: https://issues.apache.org/jira/browse/SPARK-4
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: pull-request-available
>
> To avoid data issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47818) Introduce plan cache in SparkConnectPlanner to improve performance of Analyze requests

2024-04-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47818:
---
Labels: pull-request-available  (was: )

> Introduce plan cache in SparkConnectPlanner to improve performance of Analyze 
> requests
> --
>
> Key: SPARK-47818
> URL: https://issues.apache.org/jira/browse/SPARK-47818
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Xi Lyu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> While building the DataFrame step by step, each time a new DataFrame is 
> generated with an empty schema, which is lazily computed on access. However, 
> if a user's code frequently accesses the schema of these new DataFrames using 
> methods such as `df.columns`, it will result in a large number of Analyze 
> requests to the server. Each time, the entire plan needs to be reanalyzed, 
> leading to poor performance, especially when constructing highly complex 
> plans.
> Now, by introducing plan cache in SparkConnectPlanner, we aim to reduce the 
> overhead of repeated analysis during this process. This is achieved by saving 
> significant computation if the resolved logical plan of a subtree of can be 
> cached.
> A minimal example of the problem:
> {code:java}
> import pyspark.sql.functions as F
> df = spark.range(10)
> for i in range(200):
>   if str(i) not in df.columns: # <-- The df.columns call causes a new Analyze 
> request in every iteration
>     df = df.withColumn(str(i), F.col("id") + i)
> df.show() {code}
> With this patch, the performance of the above code improved from ~115s to ~5s.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47818) Introduce plan cache in SparkConnectPlanner to improve performance of Analyze requests

2024-04-11 Thread Xi Lyu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xi Lyu updated SPARK-47818:
---
Description: 
While building the DataFrame step by step, each time a new DataFrame is 
generated with an empty schema, which is lazily computed on access. However, if 
a user's code frequently accesses the schema of these new DataFrames using 
methods such as `df.columns`, it will result in a large number of Analyze 
requests to the server. Each time, the entire plan needs to be reanalyzed, 
leading to poor performance, especially when constructing highly complex plans.

Now, by introducing plan cache in SparkConnectPlanner, we aim to reduce the 
overhead of repeated analysis during this process. This is achieved by saving 
significant computation if the resolved logical plan of a subtree of can be 
cached.

A minimal example of the problem:
{code:java}
import pyspark.sql.functions as F
df = spark.range(10)
for i in range(200):
  if str(i) not in df.columns: # <-- The df.columns call causes a new Analyze 
request in every iteration
    df = df.withColumn(str(i), F.col("id") + i)
df.show() {code}
With this patch, the performance of the above code improved from ~115s to ~5s.

  was:
While building the DataFrame step by step, each time a new DataFrame is 
generated with an empty schema, which is lazily computed on access. However, if 
a user's code frequently accesses the schema of these new DataFrames using 
methods such as `df.columns`, it will result in a large number of Analyze 
requests to the server. Each time, the entire plan needs to be reanalyzed, 
leading to poor performance, especially when constructing highly complex plans.

Now, by introducing plan cache in SparkConnectPlanner, we aim to reduce the 
overhead of repeated analysis during this process. This is achieved by saving 
significant computation if the resolved logical plan of a subtree of can be 
cached.

A minimal example of the problem:
{code:java}
import pyspark.sql.functions as F
df = spark.range(10)
for i in range(200):
  if str(i) not in df.columns: # <-- The df.columns call causes a new Analyze 
request in every iteration
    df = df.withColumn(str(i), F.col("id") + i)
df.show() {code}
 


> Introduce plan cache in SparkConnectPlanner to improve performance of Analyze 
> requests
> --
>
> Key: SPARK-47818
> URL: https://issues.apache.org/jira/browse/SPARK-47818
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Xi Lyu
>Priority: Major
> Fix For: 4.0.0
>
>
> While building the DataFrame step by step, each time a new DataFrame is 
> generated with an empty schema, which is lazily computed on access. However, 
> if a user's code frequently accesses the schema of these new DataFrames using 
> methods such as `df.columns`, it will result in a large number of Analyze 
> requests to the server. Each time, the entire plan needs to be reanalyzed, 
> leading to poor performance, especially when constructing highly complex 
> plans.
> Now, by introducing plan cache in SparkConnectPlanner, we aim to reduce the 
> overhead of repeated analysis during this process. This is achieved by saving 
> significant computation if the resolved logical plan of a subtree of can be 
> cached.
> A minimal example of the problem:
> {code:java}
> import pyspark.sql.functions as F
> df = spark.range(10)
> for i in range(200):
>   if str(i) not in df.columns: # <-- The df.columns call causes a new Analyze 
> request in every iteration
>     df = df.withColumn(str(i), F.col("id") + i)
> df.show() {code}
> With this patch, the performance of the above code improved from ~115s to ~5s.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47819) Use asynchronous callback for execution cleanup

2024-04-11 Thread Xi Lyu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xi Lyu updated SPARK-47819:
---
Description: 
Expired sessions are regularly checked and cleaned up by a maintenance thread. 
However, currently, this process is synchronous. Therefore, in rare cases, 
interrupting the execution thread of a query in a session can take hours, 
causing the entire maintenance process to stall, resulting in a large amount of 
memory not being cleared.

We address this by introducing asynchronous callbacks for execution cleanup, 
avoiding synchronous joins of execution threads, and preventing the maintenance 
thread from stalling in the above scenarios. To be more specific, instead of 
calling {{runner.join()}} in ExecutorHolder.close(), we set a post-cleanup 
function as the callback through {{{}runner.processOnCompletion{}}}, which will 
be called asynchronously once the execution runner is completed or interrupted. 
In this way, the maintenance thread won't get blocked on {{{}join{}}}ing an 
execution thread.

 

  was:
Expired sessions are regularly checked and cleaned up by a maintenance thread. 
However, currently, this process is synchronous. Therefore, in occasional 
cases, interrupting the execution thread of a query in a session can take 
hours, causing the entire maintenance process to stall, resulting in a large 
amount of memory not being cleared.

We address this by introducing asynchronous callbacks for execution cleanup, 
avoiding synchronous joins of execution threads, and preventing the maintenance 
thread from stalling in the above occasional scenarios. To be more specific, 
instead of calling {{runner.join()}} in ExecutorHolder.close(), we set a 
post-cleanup function as the callback through 
{{{}runner.processOnCompletion{}}}, which will be called asynchronously once 
the execution runner is completed or interrupted. In this way, the maintenance 
thread won't get blocked on {{{}join{}}}ing an execution thread.

 


> Use asynchronous callback for execution cleanup
> ---
>
> Key: SPARK-47819
> URL: https://issues.apache.org/jira/browse/SPARK-47819
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Xi Lyu
>Priority: Major
> Fix For: 4.0.0
>
>
> Expired sessions are regularly checked and cleaned up by a maintenance 
> thread. However, currently, this process is synchronous. Therefore, in rare 
> cases, interrupting the execution thread of a query in a session can take 
> hours, causing the entire maintenance process to stall, resulting in a large 
> amount of memory not being cleared.
> We address this by introducing asynchronous callbacks for execution cleanup, 
> avoiding synchronous joins of execution threads, and preventing the 
> maintenance thread from stalling in the above scenarios. To be more specific, 
> instead of calling {{runner.join()}} in ExecutorHolder.close(), we set a 
> post-cleanup function as the callback through 
> {{{}runner.processOnCompletion{}}}, which will be called asynchronously once 
> the execution runner is completed or interrupted. In this way, the 
> maintenance thread won't get blocked on {{{}join{}}}ing an execution thread.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47777) Add spark connect test for python streaming data source

2024-04-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-4:
--
Component/s: Tests

> Add spark connect test for python streaming data source
> ---
>
> Key: SPARK-4
> URL: https://issues.apache.org/jira/browse/SPARK-4
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, SS, Tests
>Affects Versions: 3.5.1
>Reporter: Chaoqin Li
>Assignee: Chaoqin Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Make python streaming data source pyspark test also runs on spark connect. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47777) Add spark connect test for python streaming data source

2024-04-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-4:
-

Assignee: Chaoqin Li

> Add spark connect test for python streaming data source
> ---
>
> Key: SPARK-4
> URL: https://issues.apache.org/jira/browse/SPARK-4
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, SS
>Affects Versions: 3.5.1
>Reporter: Chaoqin Li
>Assignee: Chaoqin Li
>Priority: Major
>  Labels: pull-request-available
>
> Make python streaming data source pyspark test also runs on spark connect. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44444) Enabled ANSI mode by default

2024-04-11 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-4?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836283#comment-17836283
 ] 

L. C. Hsieh commented on SPARK-4:
-

What impact this change could bring to user applications? Query failure? Or 
different query result?

> Enabled ANSI mode by default
> 
>
> Key: SPARK-4
> URL: https://issues.apache.org/jira/browse/SPARK-4
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> To avoid data issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47777) Add spark connect test for python streaming data source

2024-04-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-4.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45950
[https://github.com/apache/spark/pull/45950]

> Add spark connect test for python streaming data source
> ---
>
> Key: SPARK-4
> URL: https://issues.apache.org/jira/browse/SPARK-4
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, SS
>Affects Versions: 3.5.1
>Reporter: Chaoqin Li
>Assignee: Chaoqin Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Make python streaming data source pyspark test also runs on spark connect. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47820) Run `ANSI` SQL CI twice per day

2024-04-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47820.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46010
[https://github.com/apache/spark/pull/46010]

> Run `ANSI` SQL CI twice per day
> ---
>
> Key: SPARK-47820
> URL: https://issues.apache.org/jira/browse/SPARK-47820
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47820) Run `ANSI` SQL CI twice per day

2024-04-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47820:
---
Labels: pull-request-available  (was: )

> Run `ANSI` SQL CI twice per day
> ---
>
> Key: SPARK-47820
> URL: https://issues.apache.org/jira/browse/SPARK-47820
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47820) Run `ANSI` SQL CI twice per day

2024-04-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47820:
--
Summary: Run `ANSI` SQL CI twice per day  (was: Run `ANSI` SQL Daily CI 
twice)

> Run `ANSI` SQL CI twice per day
> ---
>
> Key: SPARK-47820
> URL: https://issues.apache.org/jira/browse/SPARK-47820
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47820) Run `ANSI` SQL Daily CI twice

2024-04-11 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-47820:
-

 Summary: Run `ANSI` SQL Daily CI twice
 Key: SPARK-47820
 URL: https://issues.apache.org/jira/browse/SPARK-47820
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47820) Run `ANSI` SQL Daily CI twice

2024-04-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47820:
--
Parent: SPARK-44111
Issue Type: Sub-task  (was: Test)

> Run `ANSI` SQL Daily CI twice
> -
>
> Key: SPARK-47820
> URL: https://issues.apache.org/jira/browse/SPARK-47820
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47799) Preserve parameter information when using SBT package jar

2024-04-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47799:
--
Affects Version/s: 4.0.0
   (was: 3.5.1)

> Preserve parameter information when using SBT package jar
> -
>
> Key: SPARK-47799
> URL: https://issues.apache.org/jira/browse/SPARK-47799
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47799) Preserve parameter information when using SBT package jar

2024-04-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47799.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45983
[https://github.com/apache/spark/pull/45983]

> Preserve parameter information when using SBT package jar
> -
>
> Key: SPARK-47799
> URL: https://issues.apache.org/jira/browse/SPARK-47799
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47799) Preserve parameter information when using SBT package jar

2024-04-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47799:
-

Assignee: dzcxzl

> Preserve parameter information when using SBT package jar
> -
>
> Key: SPARK-47799
> URL: https://issues.apache.org/jira/browse/SPARK-47799
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47819) Use asynchronous callback for execution cleanup

2024-04-11 Thread Xi Lyu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xi Lyu updated SPARK-47819:
---
Description: 
Expired sessions are regularly checked and cleaned up by a maintenance thread. 
However, currently, this process is synchronous. Therefore, in occasional 
cases, interrupting the execution thread of a query in a session can take 
hours, causing the entire maintenance process to stall, resulting in a large 
amount of memory not being cleared.

We address this by introducing asynchronous callbacks for execution cleanup, 
avoiding synchronous joins of execution threads, and preventing the maintenance 
thread from stalling in the above occasional scenarios. To be more specific, 
instead of calling {{runner.join()}} in ExecutorHolder.close(), we set a 
post-cleanup function as the callback through 
{{{}runner.processOnCompletion{}}}, which will be called asynchronously once 
the execution runner is completed or interrupted. In this way, the maintenance 
thread won't get blocked on {{{}join{}}}ing an execution thread.

 

  was:
Expired sessions are regularly checked and cleaned up by a maintenance thread. 
However, currently, this process is synchronous. Therefore, in occasional 
cases, interrupting the execution thread of a query in a session can take 
hours, causing the entire maintenance process to stall, resulting in a large 
amount of memory not being cleared.

We address this by introducing asynchronous callbacks for execution cleanup, 
avoiding synchronous joins of execution threads, and preventing the maintenance 
thread from stalling in the above occasional scenarios.

 


> Use asynchronous callback for execution cleanup
> ---
>
> Key: SPARK-47819
> URL: https://issues.apache.org/jira/browse/SPARK-47819
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Xi Lyu
>Priority: Major
> Fix For: 4.0.0
>
>
> Expired sessions are regularly checked and cleaned up by a maintenance 
> thread. However, currently, this process is synchronous. Therefore, in 
> occasional cases, interrupting the execution thread of a query in a session 
> can take hours, causing the entire maintenance process to stall, resulting in 
> a large amount of memory not being cleared.
> We address this by introducing asynchronous callbacks for execution cleanup, 
> avoiding synchronous joins of execution threads, and preventing the 
> maintenance thread from stalling in the above occasional scenarios. To be 
> more specific, instead of calling {{runner.join()}} in 
> ExecutorHolder.close(), we set a post-cleanup function as the callback 
> through {{{}runner.processOnCompletion{}}}, which will be called 
> asynchronously once the execution runner is completed or interrupted. In this 
> way, the maintenance thread won't get blocked on {{{}join{}}}ing an execution 
> thread.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47818) Introduce plan cache in SparkConnectPlanner to improve performance of Analyze requests

2024-04-11 Thread Xi Lyu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xi Lyu updated SPARK-47818:
---
Description: 
While building the DataFrame step by step, each time a new DataFrame is 
generated with an empty schema, which is lazily computed on access. However, if 
a user's code frequently accesses the schema of these new DataFrames using 
methods such as `df.columns`, it will result in a large number of Analyze 
requests to the server. Each time, the entire plan needs to be reanalyzed, 
leading to poor performance, especially when constructing highly complex plans.

Now, by introducing plan cache in SparkConnectPlanner, we aim to reduce the 
overhead of repeated analysis during this process. This is achieved by saving 
significant computation if the resolved logical plan of a subtree of can be 
cached.

A minimal example of the problem:
{code:java}
import pyspark.sql.functions as F
df = spark.range(10)
for i in range(200):
  if str(i) not in df.columns: # <-- The df.columns call causes a new Analyze 
request in every iteration
    df = df.withColumn(str(i), F.col("id") + i)
df.show() {code}
 

  was:
While building the DataFrame step by step, each time a new DataFrame is 
generated with an empty schema, which is lazily computed on access. However, if 
a user's code frequently accesses the schema of these new DataFrames using 
methods such as `df.columns`, it will result in a large number of Analyze 
requests to the server. Each time, the entire plan needs to be reanalyzed, 
leading to poor performance, especially when constructing highly complex plans.

Now, by introducing plan cache in SparkConnectPlanner, we aim to reduce the 
overhead of repeated analysis during this process. This is achieved by saving 
significant computation if the resolved logical plan of a subtree of can be 
cached.


> Introduce plan cache in SparkConnectPlanner to improve performance of Analyze 
> requests
> --
>
> Key: SPARK-47818
> URL: https://issues.apache.org/jira/browse/SPARK-47818
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Xi Lyu
>Priority: Major
> Fix For: 4.0.0
>
>
> While building the DataFrame step by step, each time a new DataFrame is 
> generated with an empty schema, which is lazily computed on access. However, 
> if a user's code frequently accesses the schema of these new DataFrames using 
> methods such as `df.columns`, it will result in a large number of Analyze 
> requests to the server. Each time, the entire plan needs to be reanalyzed, 
> leading to poor performance, especially when constructing highly complex 
> plans.
> Now, by introducing plan cache in SparkConnectPlanner, we aim to reduce the 
> overhead of repeated analysis during this process. This is achieved by saving 
> significant computation if the resolved logical plan of a subtree of can be 
> cached.
> A minimal example of the problem:
> {code:java}
> import pyspark.sql.functions as F
> df = spark.range(10)
> for i in range(200):
>   if str(i) not in df.columns: # <-- The df.columns call causes a new Analyze 
> request in every iteration
>     df = df.withColumn(str(i), F.col("id") + i)
> df.show() {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47819) Use asynchronous callback for execution cleanup

2024-04-11 Thread Xi Lyu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xi Lyu updated SPARK-47819:
---
Description: 
Expired sessions are regularly checked and cleaned up by a maintenance thread. 
However, currently, this process is synchronous. Therefore, in occasional 
cases, interrupting the execution thread of a query in a session can take 
hours, causing the entire maintenance process to stall, resulting in a large 
amount of memory not being cleared.

We address this by introducing asynchronous callbacks for execution cleanup, 
avoiding synchronous joins of execution threads, and preventing the maintenance 
thread from stalling in the above occasional scenarios.

 

  was:
Expired sessions are regularly checked and cleaned up by a maintenance thread. 
However, currently, this process is synchronous. Therefore, in occasional 
cases, interrupting the execution thread of a query in a session can take 
hours, causing the entire maintenance process to stall, resulting in a large 
amount of memory not being cleared.

We address this by introducing asynchronous callbacks for execution cleanup, 
avoiding synchronous joins of execution threads, and preventing the maintenance 
thread from stalling in the above occasional scenarios.

A minimal example of the problem:
{code:java}
import pyspark.sql.functions as F
df = spark.range(10)
for i in range(200):
  if str(i) not in df.columns: # <-- The df.columns call causes a new Analyze 
request in every iteration
    df = df.withColumn(str(i), F.col("id") + i)
df.show() {code}
 


> Use asynchronous callback for execution cleanup
> ---
>
> Key: SPARK-47819
> URL: https://issues.apache.org/jira/browse/SPARK-47819
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Xi Lyu
>Priority: Major
> Fix For: 4.0.0
>
>
> Expired sessions are regularly checked and cleaned up by a maintenance 
> thread. However, currently, this process is synchronous. Therefore, in 
> occasional cases, interrupting the execution thread of a query in a session 
> can take hours, causing the entire maintenance process to stall, resulting in 
> a large amount of memory not being cleared.
> We address this by introducing asynchronous callbacks for execution cleanup, 
> avoiding synchronous joins of execution threads, and preventing the 
> maintenance thread from stalling in the above occasional scenarios.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47819) Use asynchronous callback for execution cleanup

2024-04-11 Thread Xi Lyu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xi Lyu updated SPARK-47819:
---
Description: 
Expired sessions are regularly checked and cleaned up by a maintenance thread. 
However, currently, this process is synchronous. Therefore, in occasional 
cases, interrupting the execution thread of a query in a session can take 
hours, causing the entire maintenance process to stall, resulting in a large 
amount of memory not being cleared.

We address this by introducing asynchronous callbacks for execution cleanup, 
avoiding synchronous joins of execution threads, and preventing the maintenance 
thread from stalling in the above occasional scenarios.

A minimal example of the problem:
{code:java}
import pyspark.sql.functions as F
df = spark.range(10)
for i in range(200):
  if str(i) not in df.columns: # <-- The df.columns call causes a new Analyze 
request in every iteration
    df = df.withColumn(str(i), F.col("id") + i)
df.show() {code}
 

  was:
Expired sessions are regularly checked and cleaned up by a maintenance thread. 
However, currently, this process is synchronous. Therefore, in occasional 
cases, interrupting the execution thread of a query in a session can take 
hours, causing the entire maintenance process to stall, resulting in a large 
amount of memory not being cleared.

We address this by introducing asynchronous callbacks for execution cleanup, 
avoiding synchronous joins of execution threads, and preventing the maintenance 
thread from stalling in the above occasional scenarios.


> Use asynchronous callback for execution cleanup
> ---
>
> Key: SPARK-47819
> URL: https://issues.apache.org/jira/browse/SPARK-47819
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Xi Lyu
>Priority: Major
> Fix For: 4.0.0
>
>
> Expired sessions are regularly checked and cleaned up by a maintenance 
> thread. However, currently, this process is synchronous. Therefore, in 
> occasional cases, interrupting the execution thread of a query in a session 
> can take hours, causing the entire maintenance process to stall, resulting in 
> a large amount of memory not being cleared.
> We address this by introducing asynchronous callbacks for execution cleanup, 
> avoiding synchronous joins of execution threads, and preventing the 
> maintenance thread from stalling in the above occasional scenarios.
> A minimal example of the problem:
> {code:java}
> import pyspark.sql.functions as F
> df = spark.range(10)
> for i in range(200):
>   if str(i) not in df.columns: # <-- The df.columns call causes a new Analyze 
> request in every iteration
>     df = df.withColumn(str(i), F.col("id") + i)
> df.show() {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47819) Use asynchronous callback for execution cleanup

2024-04-11 Thread Xi Lyu (Jira)

Xi Lyu created SPARK-47819:
--

 Summary: Use asynchronous callback for execution cleanup
 Key: SPARK-47819
 URL: https://issues.apache.org/jira/browse/SPARK-47819
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 4.0.0
Reporter: Xi Lyu
 Fix For: 4.0.0


Expired sessions are regularly checked and cleaned up by a maintenance thread. 
However, currently, this process is synchronous. Therefore, in occasional 
cases, interrupting the execution thread of a query in a session can take 
hours, causing the entire maintenance process to stall, resulting in a large 
amount of memory not being cleared.

We address this by introducing asynchronous callbacks for execution cleanup, 
avoiding synchronous joins of execution threads, and preventing the maintenance 
thread from stalling in the above occasional scenarios.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47418) Optimize string predicate expressions for UTF8_BINARY_LCASE collation

2024-04-11 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-47418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uroš Bojanić updated SPARK-47418:
-
Description: 
Implement {*}contains{*}, {*}startsWith{*}, and *endsWith* built-in string 
Spark functions using optimized lowercase comparison approach introduced by 
[~nikolamand-db] in [https://github.com/apache/spark/pull/45816]. Refer to the 
latest design and code structure imposed by [~uros-db] in 
https://issues.apache.org/jira/browse/SPARK-47410 to understand how collation 
support is introduced for Spark SQL expressions. In addition, review previous 
Jira tickets under the current parent in order to understand how 
*StringPredicate* expressions are currently used and tested in Spark:
 * [SPARK-47131|https://issues.apache.org/jira/browse/SPARK-47131]
 * [SPARK-47248|https://issues.apache.org/jira/browse/SPARK-47248]
 * [SPARK-47295|https://issues.apache.org/jira/browse/SPARK-47295]

These tickets should help you understand what changes were introduced in order 
to enable collation support for these functions. Lastly, feel free to use your 
chosen Spark SQL Editor to play around with the existing functions and learn 
more about how they work.

 

The goal for this Jira ticket is to improve the UTF8_BINARY_LCASE 
implementation for the {*}contains{*}, {*}startsWith{*}, and *endsWith* 
functions so that they use optimized lowercase comparison approach (following 
the general logic in Nikola's PR), and benchmark the results accordingly. As 
for testing, the currently existing unit test cases and end-to-end tests should 
already fully cover the expected behaviour of *StringPredicate* expressions for 
all collation types. In other words, the objective of this ticket is only to 
enhance the internal implementation, without introducing any user-facing 
changes to Spark SQL API.

 

Finally, feel free to refer to the Unicode Technical Standard for string 
[searching|https://www.unicode.org/reports/tr10/#Searching] and 
[collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].

> Optimize string predicate expressions for UTF8_BINARY_LCASE collation
> -
>
> Key: SPARK-47418
> URL: https://issues.apache.org/jira/browse/SPARK-47418
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>
> Implement {*}contains{*}, {*}startsWith{*}, and *endsWith* built-in string 
> Spark functions using optimized lowercase comparison approach introduced by 
> [~nikolamand-db] in [https://github.com/apache/spark/pull/45816]. Refer to 
> the latest design and code structure imposed by [~uros-db] in 
> https://issues.apache.org/jira/browse/SPARK-47410 to understand how collation 
> support is introduced for Spark SQL expressions. In addition, review previous 
> Jira tickets under the current parent in order to understand how 
> *StringPredicate* expressions are currently used and tested in Spark:
>  * [SPARK-47131|https://issues.apache.org/jira/browse/SPARK-47131]
>  * [SPARK-47248|https://issues.apache.org/jira/browse/SPARK-47248]
>  * [SPARK-47295|https://issues.apache.org/jira/browse/SPARK-47295]
> These tickets should help you understand what changes were introduced in 
> order to enable collation support for these functions. Lastly, feel free to 
> use your chosen Spark SQL Editor to play around with the existing functions 
> and learn more about how they work.
>  
> The goal for this Jira ticket is to improve the UTF8_BINARY_LCASE 
> implementation for the {*}contains{*}, {*}startsWith{*}, and *endsWith* 
> functions so that they use optimized lowercase comparison approach (following 
> the general logic in Nikola's PR), and benchmark the results accordingly. As 
> for testing, the currently existing unit test cases and end-to-end tests 
> should already fully cover the expected behaviour of *StringPredicate* 
> expressions for all collation types. In other words, the objective of this 
> ticket is only to enhance the internal implementation, without introducing 
> any user-facing changes to Spark SQL API.
>  
> Finally, feel free to refer to the Unicode Technical Standard for string 
> [searching|https://www.unicode.org/reports/tr10/#Searching] and 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47797) Skip deleting pod from k8s if the pod does not exists

2024-04-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47797:
--
Affects Version/s: 4.0.0
   (was: 3.5.1)

> Skip deleting pod from k8s if the pod does not exists
> -
>
> Key: SPARK-47797
> URL: https://issues.apache.org/jira/browse/SPARK-47797
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 4.0.0
>Reporter: leesf
>Assignee: leesf
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> in ExecutorPodsLifecycleManager#removeExecutorFromK8s method. Get pod before 
> deleting it, we can skip deleting if pod is already deleted so that we do not 
> send too many requests to api server.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47797) Skip deleting pod from k8s if the pod does not exists

2024-04-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47797.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45979
[https://github.com/apache/spark/pull/45979]

> Skip deleting pod from k8s if the pod does not exists
> -
>
> Key: SPARK-47797
> URL: https://issues.apache.org/jira/browse/SPARK-47797
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.5.1
>Reporter: leesf
>Assignee: leesf
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> in ExecutorPodsLifecycleManager#removeExecutorFromK8s method. Get pod before 
> deleting it, we can skip deleting if pod is already deleted so that we do not 
> send too many requests to api server.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47797) Skip deleting pod from k8s if the pod does not exists

2024-04-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47797:
-

Assignee: leesf

> Skip deleting pod from k8s if the pod does not exists
> -
>
> Key: SPARK-47797
> URL: https://issues.apache.org/jira/browse/SPARK-47797
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.5.1
>Reporter: leesf
>Assignee: leesf
>Priority: Major
>  Labels: pull-request-available
>
> in ExecutorPodsLifecycleManager#removeExecutorFromK8s method. Get pod before 
> deleting it, we can skip deleting if pod is already deleted so that we do not 
> send too many requests to api server.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47818) Introduce plan cache in SparkConnectPlanner to improve performance of Analyze requests

2024-04-11 Thread Xi Lyu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xi Lyu updated SPARK-47818:
---
Description: 
While building the DataFrame step by step, each time a new DataFrame is 
generated with an empty schema, which is lazily computed on access. However, if 
a user's code frequently accesses the schema of these new DataFrames using 
methods such as `df.columns`, it will result in a large number of Analyze 
requests to the server. Each time, the entire plan needs to be reanalyzed, 
leading to poor performance, especially when constructing highly complex plans.

Now, by introducing plan cache in SparkConnectPlanner, we aim to reduce the 
overhead of repeated analysis during this process. This is achieved by saving 
significant computation if the resolved logical plan of a subtree of can be 
cached.

  was:
While building the DataFrame step by step, each time a new DataFrame is 
generated with an empty schema. However, if a user's code frequently accesses 
the schema of these new DataFrames using methods such as `df.columns`, it will 
result in a large number of Analyze requests to the server. Each time, the 
entire plan needs to be reanalyzed, leading to poor performance, especially 
when constructing highly complex plans.

Now, by introducing plan cache in SparkConnectPlanner, we aim to reduce the 
overhead of repeated analysis during this process. This is achieved by saving 
significant computation if the resolved logical plan of a subtree of can be 
cached.


> Introduce plan cache in SparkConnectPlanner to improve performance of Analyze 
> requests
> --
>
> Key: SPARK-47818
> URL: https://issues.apache.org/jira/browse/SPARK-47818
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Xi Lyu
>Priority: Major
> Fix For: 4.0.0
>
>
> While building the DataFrame step by step, each time a new DataFrame is 
> generated with an empty schema, which is lazily computed on access. However, 
> if a user's code frequently accesses the schema of these new DataFrames using 
> methods such as `df.columns`, it will result in a large number of Analyze 
> requests to the server. Each time, the entire plan needs to be reanalyzed, 
> leading to poor performance, especially when constructing highly complex 
> plans.
> Now, by introducing plan cache in SparkConnectPlanner, we aim to reduce the 
> overhead of repeated analysis during this process. This is achieved by saving 
> significant computation if the resolved logical plan of a subtree of can be 
> cached.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47818) Introduce plan cache in SparkConnectPlanner to improve performance of Analyze requests

2024-04-11 Thread Xi Lyu (Jira)

Xi Lyu created SPARK-47818:
--

 Summary: Introduce plan cache in SparkConnectPlanner to improve 
performance of Analyze requests
 Key: SPARK-47818
 URL: https://issues.apache.org/jira/browse/SPARK-47818
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 4.0.0
Reporter: Xi Lyu
 Fix For: 4.0.0


While building the DataFrame step by step, each time a new DataFrame is 
generated with an empty schema. However, if a user's code frequently accesses 
the schema of these new DataFrames using methods such as `df.columns`, it will 
result in a large number of Analyze requests to the server. Each time, the 
entire plan needs to be reanalyzed, leading to poor performance, especially 
when constructing highly complex plans.

Now, by introducing plan cache in SparkConnectPlanner, we aim to reduce the 
overhead of repeated analysis during this process. This is achieved by saving 
significant computation if the resolved logical plan of a subtree of can be 
cached.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47418) Optimize string predicate expressions for UTF8_BINARY_LCASE collation

2024-04-11 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-47418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uroš Bojanić updated SPARK-47418:
-
Summary: Optimize string predicate expressions for UTF8_BINARY_LCASE 
collation  (was: TBD)

> Optimize string predicate expressions for UTF8_BINARY_LCASE collation
> -
>
> Key: SPARK-47418
> URL: https://issues.apache.org/jira/browse/SPARK-47418
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47817) Update pandas to 2.2.2

2024-04-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47817:
-

Assignee: Bjørn Jørgensen

> Update pandas to 2.2.2
> --
>
> Key: SPARK-47817
> URL: https://issues.apache.org/jira/browse/SPARK-47817
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Pandas API on Spark
>Affects Versions: 4.0.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
>  Labels: pull-request-available
>
> [Release notes|https://pandas.pydata.org/docs/whatsnew/v2.2.2.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47817) Update pandas to 2.2.2

2024-04-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47817:
---
Labels: pull-request-available  (was: )

> Update pandas to 2.2.2
> --
>
> Key: SPARK-47817
> URL: https://issues.apache.org/jira/browse/SPARK-47817
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Pandas API on Spark
>Affects Versions: 4.0.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>  Labels: pull-request-available
>
> [Release notes|https://pandas.pydata.org/docs/whatsnew/v2.2.2.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47817) Update pandas to 2.2.2

2024-04-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47817.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46009
[https://github.com/apache/spark/pull/46009]

> Update pandas to 2.2.2
> --
>
> Key: SPARK-47817
> URL: https://issues.apache.org/jira/browse/SPARK-47817
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Pandas API on Spark
>Affects Versions: 4.0.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> [Release notes|https://pandas.pydata.org/docs/whatsnew/v2.2.2.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47410) refactor UTF8String and CollationFactory

2024-04-11 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-47410:
---

Assignee: Uroš Bojanić

> refactor UTF8String and CollationFactory
> 
>
> Key: SPARK-47410
> URL: https://issues.apache.org/jira/browse/SPARK-47410
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>
> This ticket addresses the need to refactor the {{UTF8String}} and 
> {{CollationFactory}} classes within Spark to enhance support for 
> collation-aware expressions. The goal is to improve code structure, 
> maintainability, readability, and testing coverage for collation-aware Spark 
> SQL expressions.
> The changed introduced herein should simplify addition of new collation-aware 
> operations and ensure consistent testing across the codebase.
>  
> To further support the addition of collation support for new Spark 
> expressions, here are a couple of guidelines to follow:
>  
> // 1. Collation-aware expression implementation
> CollationSupport.java
>  * should serve as a static entry point for collation-aware expressions, 
> providing custom support
>  * for example: one by one Spark expression with corresponding collation 
> support
>  * also note that: CollationAwareUTF8String should be used for 
> collation-aware UTF8String operations & other utility methods
> CollationFactory.java
>  * should continue to serve as a static provider for high-level collation 
> interface
>  * for example: interacting with external ICU components such as Collator, 
> StringSearch, etc.
>  * also note that: no low-level / expression-specific code should generally 
> be found here
> UTF8String.java
>  * should be largely collation-unaware, and generally be used only as 
> storage, nothing else
>  * for example: don’t change this class at all (with the only one-time 
> exception of: semanticEquals/Compare)
>  * also note that: no collation-aware operation implementations (using 
> collationId) should be put here
> stringExpressions.scala / regexpExpressions.scala / other 
> “sql.catalyst.expressions” (for example: Between.scala)
>  * should only contain minimal changes in order to re-route collation-aware 
> implementations to CollationSupport
>  * for example: most changes should be in relation to: adding collationId, 
> using correct data types, replacements, etc.
>  * also note that: nullSafeEval & doGenCode should likely note contain 
> introduce extra branching based on collationId
>  
> // 2. Collation-aware expression testing
> CollationSuite.scala
>  * should be used for testing more general collation concepts
>  * for example: collate/collation expressions, collation names, DDL, casting, 
> aggregate, shuffle, join, etc.
>  * also note that: no extra tests should generally be added
> CollationSupportSuite.java
>  * should be used for expression unit tests, these tests should be as 
> rigorous as possible in order to cover various cases
>  * for example: unit tests that test collation-aware expression 
> implementation for various collations (binary, lowercase, ICU)
>  * also note that: these tests should generally be written after adding 
> appropriate expression support in CollationSupport.java
> CollationStringExpressionsSuite.scala / CollationRegexpExpressionsSuite.scala 
> / CollationExpressionSuite.scala
>  * should be used for expression end-to-end tests, these tests should only 
> cover crucial expression behaviour
>  * for example: SQL tests that verify query execution results, expected 
> return data types, casting, unsupported collation handling, etc.
>  * also note that: these tests should generally be written after enabling 
> appropriate expression support in stringExpressions.scala
>  
> // 3. Closing notes
>  * Carefully think about performance implications of newly added custom 
> collation-aware expression implementation
>  * for example: be very careful with extra string allocations (UTF8Strings -> 
> (Java) String -> UTF8Strings, etc.)
>  * also note that: some operations introduce very heavy performance penalties 
> (we should avoid the ones we can)
>  
>  * Make sure to test all newly added expressions and completely (unit tests, 
> end-to-end tests, etc.)
>  * for example: consider edge, such as: empty strings, uppercase and 
> lowercase mix, different byte-length chars, etc.
>  * also note that: all similar tests should be uniform & readable and be kept 
> in one place for various expressions
>  
>  * Consider how new expressions interact with the rest of the system 
> (casting; collation support level - use correct AbstractStringType, etc.)
>  * for example: we should watch out for casting, test it

[jira] [Resolved] (SPARK-47410) refactor UTF8String and CollationFactory

2024-04-11 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47410.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45978
[https://github.com/apache/spark/pull/45978]

> refactor UTF8String and CollationFactory
> 
>
> Key: SPARK-47410
> URL: https://issues.apache.org/jira/browse/SPARK-47410
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> This ticket addresses the need to refactor the {{UTF8String}} and 
> {{CollationFactory}} classes within Spark to enhance support for 
> collation-aware expressions. The goal is to improve code structure, 
> maintainability, readability, and testing coverage for collation-aware Spark 
> SQL expressions.
> The changed introduced herein should simplify addition of new collation-aware 
> operations and ensure consistent testing across the codebase.
>  
> To further support the addition of collation support for new Spark 
> expressions, here are a couple of guidelines to follow:
>  
> // 1. Collation-aware expression implementation
> CollationSupport.java
>  * should serve as a static entry point for collation-aware expressions, 
> providing custom support
>  * for example: one by one Spark expression with corresponding collation 
> support
>  * also note that: CollationAwareUTF8String should be used for 
> collation-aware UTF8String operations & other utility methods
> CollationFactory.java
>  * should continue to serve as a static provider for high-level collation 
> interface
>  * for example: interacting with external ICU components such as Collator, 
> StringSearch, etc.
>  * also note that: no low-level / expression-specific code should generally 
> be found here
> UTF8String.java
>  * should be largely collation-unaware, and generally be used only as 
> storage, nothing else
>  * for example: don’t change this class at all (with the only one-time 
> exception of: semanticEquals/Compare)
>  * also note that: no collation-aware operation implementations (using 
> collationId) should be put here
> stringExpressions.scala / regexpExpressions.scala / other 
> “sql.catalyst.expressions” (for example: Between.scala)
>  * should only contain minimal changes in order to re-route collation-aware 
> implementations to CollationSupport
>  * for example: most changes should be in relation to: adding collationId, 
> using correct data types, replacements, etc.
>  * also note that: nullSafeEval & doGenCode should likely note contain 
> introduce extra branching based on collationId
>  
> // 2. Collation-aware expression testing
> CollationSuite.scala
>  * should be used for testing more general collation concepts
>  * for example: collate/collation expressions, collation names, DDL, casting, 
> aggregate, shuffle, join, etc.
>  * also note that: no extra tests should generally be added
> CollationSupportSuite.java
>  * should be used for expression unit tests, these tests should be as 
> rigorous as possible in order to cover various cases
>  * for example: unit tests that test collation-aware expression 
> implementation for various collations (binary, lowercase, ICU)
>  * also note that: these tests should generally be written after adding 
> appropriate expression support in CollationSupport.java
> CollationStringExpressionsSuite.scala / CollationRegexpExpressionsSuite.scala 
> / CollationExpressionSuite.scala
>  * should be used for expression end-to-end tests, these tests should only 
> cover crucial expression behaviour
>  * for example: SQL tests that verify query execution results, expected 
> return data types, casting, unsupported collation handling, etc.
>  * also note that: these tests should generally be written after enabling 
> appropriate expression support in stringExpressions.scala
>  
> // 3. Closing notes
>  * Carefully think about performance implications of newly added custom 
> collation-aware expression implementation
>  * for example: be very careful with extra string allocations (UTF8Strings -> 
> (Java) String -> UTF8Strings, etc.)
>  * also note that: some operations introduce very heavy performance penalties 
> (we should avoid the ones we can)
>  
>  * Make sure to test all newly added expressions and completely (unit tests, 
> end-to-end tests, etc.)
>  * for example: consider edge, such as: empty strings, uppercase and 
> lowercase mix, different byte-length chars, etc.
>  * also note that: all similar tests should be uniform & readable and be kept 
> in one place for various expressions
>  
>  * Consider how new expressions interact with the rest of the system 
> (casting;

[jira] [Resolved] (SPARK-47617) Add TPC-DS testing infrastructure for collations

2024-04-11 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47617.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45739
[https://github.com/apache/spark/pull/45739]

> Add TPC-DS testing infrastructure for collations
> 
>
> Key: SPARK-47617
> URL: https://issues.apache.org/jira/browse/SPARK-47617
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Assignee: Nikola Mandic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> As collation support grows across all SQL features and new collation types 
> are added, we need to have reliable testing model covering as many standard 
> SQL capabilities as possible.
> We can utilize TPC-DS testing infrastructure already present in Spark. The 
> idea is to vary TPC-DS table string columns by adding multiple collations 
> with different ordering rules and case sensitivity, producing new tables. 
> These tables should yield the same results against predefined TPC-DS queries 
> for certain batches of collations. For example, when comparing query runs on 
> table where columns are first collated as UTF8_BINARY and then as 
> UTF8_BINARY_LCASE, we should be getting same results after converting to 
> lowercase.
> Introduce new query suite which tests the described behavior with available 
> collations (utf8_binary and unicode) combined with case conversions 
> (lowercase, uppercase, randomized case for fuzzy testing).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47617) Add TPC-DS testing infrastructure for collations

2024-04-11 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-47617:
---

Assignee: Nikola Mandic

> Add TPC-DS testing infrastructure for collations
> 
>
> Key: SPARK-47617
> URL: https://issues.apache.org/jira/browse/SPARK-47617
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Assignee: Nikola Mandic
>Priority: Major
>  Labels: pull-request-available
>
> As collation support grows across all SQL features and new collation types 
> are added, we need to have reliable testing model covering as many standard 
> SQL capabilities as possible.
> We can utilize TPC-DS testing infrastructure already present in Spark. The 
> idea is to vary TPC-DS table string columns by adding multiple collations 
> with different ordering rules and case sensitivity, producing new tables. 
> These tables should yield the same results against predefined TPC-DS queries 
> for certain batches of collations. For example, when comparing query runs on 
> table where columns are first collated as UTF8_BINARY and then as 
> UTF8_BINARY_LCASE, we should be getting same results after converting to 
> lowercase.
> Introduce new query suite which tests the described behavior with available 
> collations (utf8_binary and unicode) combined with case conversions 
> (lowercase, uppercase, randomized case for fuzzy testing).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47357) Add support for Upper, Lower, InitCap (all collations)

2024-04-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47357:
---
Labels: pull-request-available  (was: )

> Add support for Upper, Lower, InitCap (all collations)
> --
>
> Key: SPARK-47357
> URL: https://issues.apache.org/jira/browse/SPARK-47357
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47357) Add support for Upper, Lower, InitCap (all collations)

2024-04-11 Thread Mihailo Milosevic (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mihailo Milosevic updated SPARK-47357:
--
Summary: Add support for Upper, Lower, InitCap (all collations)  (was: 
Upper, Lower, InitCap (all collations))

> Add support for Upper, Lower, InitCap (all collations)
> --
>
> Key: SPARK-47357
> URL: https://issues.apache.org/jira/browse/SPARK-47357
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47815) Unify the user agent with json

2024-04-11 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-47815.
---
Resolution: Not A Problem

> Unify the user agent with json
> --
>
> Key: SPARK-47815
> URL: https://issues.apache.org/jira/browse/SPARK-47815
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47816) Document the lazy evaluation of views in spark.{sql, table}

2024-04-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47816:
---
Labels: pull-request-available  (was: )

> Document the lazy evaluation of views in spark.{sql, table}
> ---
>
> Key: SPARK-47816
> URL: https://issues.apache.org/jira/browse/SPARK-47816
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, Documentation
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47816) Document the lazy evaluation of views in spark.{sql, table}

2024-04-11 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-47816:
-

 Summary: Document the lazy evaluation of views in spark.{sql, 
table}
 Key: SPARK-47816
 URL: https://issues.apache.org/jira/browse/SPARK-47816
 Project: Spark
  Issue Type: Improvement
  Components: Connect, Documentation
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47809) checkExceptionInExpression should check error for each codegen mode

2024-04-11 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-47809.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45997
[https://github.com/apache/spark/pull/45997]

> checkExceptionInExpression should check error for each codegen mode
> ---
>
> Key: SPARK-47809
> URL: https://issues.apache.org/jira/browse/SPARK-47809
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47809) checkExceptionInExpression should check error for each codegen mode

2024-04-11 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-47809:


Assignee: Wenchen Fan

> checkExceptionInExpression should check error for each codegen mode
> ---
>
> Key: SPARK-47809
> URL: https://issues.apache.org/jira/browse/SPARK-47809
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47813) Replace getArrayDimension with updateExtraColumnMeta

2024-04-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47813:
---
Labels: pull-request-available  (was: )

> Replace getArrayDimension with updateExtraColumnMeta 
> -
>
> Key: SPARK-47813
> URL: https://issues.apache.org/jira/browse/SPARK-47813
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47357) Upper, Lower, InitCap (all collations)

2024-04-11 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-47357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uroš Bojanić updated SPARK-47357:
-
Summary: Upper, Lower, InitCap (all collations)  (was: Upper, Lower, 
InitCap (binary & lowercase collation only))

> Upper, Lower, InitCap (all collations)
> --
>
> Key: SPARK-47357
> URL: https://issues.apache.org/jira/browse/SPARK-47357
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47356) ConcatWs & Elt (all collations)

2024-04-11 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-47356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uroš Bojanić updated SPARK-47356:
-
Summary: ConcatWs & Elt (all collations)  (was: ConcatWs & Elt (binary & 
lowercase collation only))

> ConcatWs & Elt (all collations)
> ---
>
> Key: SPARK-47356
> URL: https://issues.apache.org/jira/browse/SPARK-47356
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47814) Remove the `KinesisTestUtils` from `main` to `test`

2024-04-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47814:
---
Labels: pull-request-available  (was: )

> Remove the `KinesisTestUtils` from `main` to `test`
> ---
>
> Key: SPARK-47814
> URL: https://issues.apache.org/jira/browse/SPARK-47814
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47815) Unify the user agent with json

2024-04-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47815:
---
Labels: pull-request-available  (was: )

> Unify the user agent with json
> --
>
> Key: SPARK-47815
> URL: https://issues.apache.org/jira/browse/SPARK-47815
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47815) Unify the user agent string with json

2024-04-11 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-47815:
--
Summary: Unify the user agent string with json  (was: Unify the user agent 
string representation with json)

> Unify the user agent string with json
> -
>
> Key: SPARK-47815
> URL: https://issues.apache.org/jira/browse/SPARK-47815
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47815) Unify the user agent string representation with json

2024-04-11 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-47815:
-

 Summary: Unify the user agent string representation with json
 Key: SPARK-47815
 URL: https://issues.apache.org/jira/browse/SPARK-47815
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47815) Unify the user agent with json

2024-04-11 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-47815:
--
Summary: Unify the user agent with json  (was: Unify the user agent string 
with json)

> Unify the user agent with json
> --
>
> Key: SPARK-47815
> URL: https://issues.apache.org/jira/browse/SPARK-47815
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47625) Addition of Indeterminate Collation Support

2024-04-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47625:
---
Labels: pull-request-available  (was: )

> Addition of Indeterminate Collation Support
> ---
>
> Key: SPARK-47625
> URL: https://issues.apache.org/jira/browse/SPARK-47625
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Priority: Major
>  Labels: pull-request-available
>
> {{INDETERMINATE_COLLATION}} should only be thrown on comparison operations 
> and memory storing of data, and we should be able to combine different 
> implicit collations for certain operations like concat and possible others in 
> the future.
> This is why we have to add another predefined collation id named 
> {{INDETERMINATE_COLLATION_ID}} which means that the result is a combination 
> of conflicting non-default implicit collations. Right now it would an id of 
> -1 so it fail if it ever goes to the {{{}CollatorFactory{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-47412) StringLPad, StringRPad (all collations)

2024-04-11 Thread Jira



[ 
https://issues.apache.org/jira/browse/SPARK-47412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836100#comment-17836100
 ] 

Uroš Bojanić commented on SPARK-47412:
--

[~gpgp] Yup, you got it! That's the expected behaviour, very similar to 
substring/left/right

> StringLPad, StringRPad (all collations)
> ---
>
> Key: SPARK-47412
> URL: https://issues.apache.org/jira/browse/SPARK-47412
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>
> Enable collation support for the *StringLPad* & *StringRPad* built-in string 
> functions in Spark. First confirm what is the expected behaviour for these 
> functions when given collated strings, then move on to the implementation 
> that would enable handling strings of all collation types. Implement the 
> corresponding unit tests (CollationStringExpressionsSuite) and E2E tests 
> (CollationSuite) to reflect how this function should be used with collation 
> in SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment 
> with the existing functions to learn more about how they work. In addition, 
> look into the possible use-cases and implementation of similar functions 
> within other other open-source DBMS, such as 
> [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *StringLPad* & *StringRPad* 
> functions so that they support all collation types currently supported in 
> Spark. To understand what changes were introduced in order to enable full 
> collation support for other existing functions in Spark, take a look at the 
> Spark PRs and Jira tickets for completed tasks in this parent (for example: 
> Contains, StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47408) TBD

2024-04-11 Thread Nikola Mandic (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikola Mandic updated SPARK-47408:
--
Summary: TBD  (was: Luhncheck (all collations))

> TBD
> ---
>
> Key: SPARK-47408
> URL: https://issues.apache.org/jira/browse/SPARK-47408
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47414) TBD

2024-04-11 Thread Nikola Mandic (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikola Mandic updated SPARK-47414:
--
Summary: TBD  (was: Length, BitLength, OctetLength (all collations))

> TBD
> ---
>
> Key: SPARK-47414
> URL: https://issues.apache.org/jira/browse/SPARK-47414
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47416) TBD

2024-04-11 Thread Nikola Mandic (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikola Mandic updated SPARK-47416:
--
Summary: TBD  (was: SoundEx (all collations))

> TBD
> ---
>
> Key: SPARK-47416
> URL: https://issues.apache.org/jira/browse/SPARK-47416
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-47814) Remove the `KinesisTestUtils` from `main` to `test`

2024-04-11 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-47814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836079#comment-17836079
 ] 

ASF GitHub Bot commented on SPARK-47814:


User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/46000

> Remove the `KinesisTestUtils` from `main` to `test`
> ---
>
> Key: SPARK-47814
> URL: https://issues.apache.org/jira/browse/SPARK-47814
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47814) Remove the `KinesisTestUtils` from `main` to `test`

2024-04-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47814:
--

Assignee: Apache Spark

> Remove the `KinesisTestUtils` from `main` to `test`
> ---
>
> Key: SPARK-47814
> URL: https://issues.apache.org/jira/browse/SPARK-47814
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47814) Remove the `KinesisTestUtils` from `main` to `test`

2024-04-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47814:
--

Assignee: (was: Apache Spark)

> Remove the `KinesisTestUtils` from `main` to `test`
> ---
>
> Key: SPARK-47814
> URL: https://issues.apache.org/jira/browse/SPARK-47814
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47814) Remove the `KinesisTestUtils` from `main` to `test`

2024-04-11 Thread BingKun Pan (Jira)

BingKun Pan created SPARK-47814:
---

 Summary: Remove the `KinesisTestUtils` from `main` to `test`
 Key: SPARK-47814
 URL: https://issues.apache.org/jira/browse/SPARK-47814
 Project: Spark
  Issue Type: Improvement
  Components: DStreams
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47617) Add TPC-DS testing infrastructure for collations

2024-04-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47617:
--

Assignee: (was: Apache Spark)

> Add TPC-DS testing infrastructure for collations
> 
>
> Key: SPARK-47617
> URL: https://issues.apache.org/jira/browse/SPARK-47617
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Priority: Major
>  Labels: pull-request-available
>
> As collation support grows across all SQL features and new collation types 
> are added, we need to have reliable testing model covering as many standard 
> SQL capabilities as possible.
> We can utilize TPC-DS testing infrastructure already present in Spark. The 
> idea is to vary TPC-DS table string columns by adding multiple collations 
> with different ordering rules and case sensitivity, producing new tables. 
> These tables should yield the same results against predefined TPC-DS queries 
> for certain batches of collations. For example, when comparing query runs on 
> table where columns are first collated as UTF8_BINARY and then as 
> UTF8_BINARY_LCASE, we should be getting same results after converting to 
> lowercase.
> Introduce new query suite which tests the described behavior with available 
> collations (utf8_binary and unicode) combined with case conversions 
> (lowercase, uppercase, randomized case for fuzzy testing).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47617) Add TPC-DS testing infrastructure for collations

2024-04-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47617:
--

Assignee: Apache Spark

> Add TPC-DS testing infrastructure for collations
> 
>
> Key: SPARK-47617
> URL: https://issues.apache.org/jira/browse/SPARK-47617
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> As collation support grows across all SQL features and new collation types 
> are added, we need to have reliable testing model covering as many standard 
> SQL capabilities as possible.
> We can utilize TPC-DS testing infrastructure already present in Spark. The 
> idea is to vary TPC-DS table string columns by adding multiple collations 
> with different ordering rules and case sensitivity, producing new tables. 
> These tables should yield the same results against predefined TPC-DS queries 
> for certain batches of collations. For example, when comparing query runs on 
> table where columns are first collated as UTF8_BINARY and then as 
> UTF8_BINARY_LCASE, we should be getting same results after converting to 
> lowercase.
> Introduce new query suite which tests the described behavior with available 
> collations (utf8_binary and unicode) combined with case conversions 
> (lowercase, uppercase, randomized case for fuzzy testing).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47800) Add method for converting v2 identifier to table identifier

2024-04-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47800:
--

Assignee: Apache Spark

> Add method for converting v2 identifier to table identifier
> ---
>
> Key: SPARK-47800
> URL: https://issues.apache.org/jira/browse/SPARK-47800
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Uros Stankovic
>Assignee: Apache Spark
>Priority: Minor
>  Labels: pull-request-available
>
> Move conversion of v2 identifier object to v1 table identifier to new method.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47800) Add method for converting v2 identifier to table identifier

2024-04-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47800:
--

Assignee: (was: Apache Spark)

> Add method for converting v2 identifier to table identifier
> ---
>
> Key: SPARK-47800
> URL: https://issues.apache.org/jira/browse/SPARK-47800
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Uros Stankovic
>Priority: Minor
>  Labels: pull-request-available
>
> Move conversion of v2 identifier object to v1 table identifier to new method.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47813) Replace getArrayDimension with updateExtraColumnMeta

2024-04-11 Thread Kent Yao (Jira)

Kent Yao created SPARK-47813:


 Summary: Replace getArrayDimension with updateExtraColumnMeta 
 Key: SPARK-47813
 URL: https://issues.apache.org/jira/browse/SPARK-47813
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 122 matches

Mail list logo