[jira] [Resolved] (SPARK-48335) Make `_parse_datatype_string` compatible with Spark Connect

2024-05-19 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-48335.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46657
[https://github.com/apache/spark/pull/46657]

> Make `_parse_datatype_string` compatible with Spark Connect
> ---
>
> Key: SPARK-48335
> URL: https://issues.apache.org/jira/browse/SPARK-48335
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48335) Make `_parse_datatype_string` compatible with Spark Connect

2024-05-19 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-48335:
-

Assignee: Ruifeng Zheng

> Make `_parse_datatype_string` compatible with Spark Connect
> ---
>
> Key: SPARK-48335
> URL: https://issues.apache.org/jira/browse/SPARK-48335
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48335) Make `_parse_datatype_string` compatible with Spark Connect

2024-05-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48335:
---
Labels: pull-request-available  (was: )

> Make `_parse_datatype_string` compatible with Spark Connect
> ---
>
> Key: SPARK-48335
> URL: https://issues.apache.org/jira/browse/SPARK-48335
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48335) Make `_parse_datatype_string` compatible with Spark Connect

2024-05-19 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-48335:
-

 Summary: Make `_parse_datatype_string` compatible with Spark 
Connect
 Key: SPARK-48335
 URL: https://issues.apache.org/jira/browse/SPARK-48335
 Project: Spark
  Issue Type: Bug
  Components: Connect, PySpark
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48335) Make `_parse_datatype_string` compatible with Spark Connect

2024-05-19 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-48335:
--
Issue Type: Improvement  (was: Bug)

> Make `_parse_datatype_string` compatible with Spark Connect
> ---
>
> Key: SPARK-48335
> URL: https://issues.apache.org/jira/browse/SPARK-48335
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48333) Test `test_sorting_functions_with_column` with same `Column`

2024-05-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48333.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46654
[https://github.com/apache/spark/pull/46654]

> Test `test_sorting_functions_with_column` with same `Column`
> 
>
> Key: SPARK-48333
> URL: https://issues.apache.org/jira/browse/SPARK-48333
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48333) Test `test_sorting_functions_with_column` with same `Column`

2024-05-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-48333:


Assignee: Ruifeng Zheng

> Test `test_sorting_functions_with_column` with same `Column`
> 
>
> Key: SPARK-48333
> URL: https://issues.apache.org/jira/browse/SPARK-48333
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48334) NettyServer doesn't shutdown if SparkContext initialize failed

2024-05-19 Thread IsisPolei (Jira)
IsisPolei created SPARK-48334:
-

 Summary: NettyServer doesn't shutdown if SparkContext initialize 
failed
 Key: SPARK-48334
 URL: https://issues.apache.org/jira/browse/SPARK-48334
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.1.3
Reporter: IsisPolei


When obtaining a SparkContext instance using SparkContext.getOrCreate(), if an 
exception occurs during initialization (such as using incorrect Spark 
parameters, e.g., spark.executor.memory=1 without units), the RpcServer started 
during this period will not be shut down, resulting in the port being occupied 
indefinitely.

The action to close the RpcServer happens in _env.stop(), where 
rpcEnv.shutdown() is executed, but this action only occurs when _env != null 
(SparkContext.scala:2106, version 3.1.3). However, the error occurs during 
initialization, and _env is not instantiated, so _env.stop() will not be 
executed, leading to the RpcServer not being closed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44924) Add configurations for FileStreamSource cached files

2024-05-19 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-44924:


Assignee: kevin nacios

> Add configurations for FileStreamSource cached files
> 
>
> Key: SPARK-44924
> URL: https://issues.apache.org/jira/browse/SPARK-44924
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: kevin nacios
>Assignee: kevin nacios
>Priority: Minor
>  Labels: pull-request-available
>
> With https://issues.apache.org/jira/browse/SPARK-30866, caching of listed 
> files was added for structured streaming to reduce cost of relisting from 
> filesystem each batch.  The settings that drive this are currently hardcoded 
> and there is no way to change them.  
>  
> This impacts some of our workloads where we process large datasets where its 
> unknown how "heavy" some files are, so a single batch can take a long period 
> of time.  When we set maxFilesPerTrigger to 100k files, a subsequent batch 
> using the cached max of 10k files is causing the job to take longer since the 
> cluster is capable of handling the 100k files but is stuck doing 10% of the 
> workload.  The benefit of the caching doesn't outweigh the cost of the 
> performance on the rest of the job.
>  
> With config settings available for this, we could either absorb some 
> increased driver memory usage for caching the next 100k files, or opt to 
> disable caching entirely and just relist files each batch by setting the 
> cache amount to 0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44924) Add configurations for FileStreamSource cached files

2024-05-19 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-44924.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45362
[https://github.com/apache/spark/pull/45362]

> Add configurations for FileStreamSource cached files
> 
>
> Key: SPARK-44924
> URL: https://issues.apache.org/jira/browse/SPARK-44924
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: kevin nacios
>Assignee: kevin nacios
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> With https://issues.apache.org/jira/browse/SPARK-30866, caching of listed 
> files was added for structured streaming to reduce cost of relisting from 
> filesystem each batch.  The settings that drive this are currently hardcoded 
> and there is no way to change them.  
>  
> This impacts some of our workloads where we process large datasets where its 
> unknown how "heavy" some files are, so a single batch can take a long period 
> of time.  When we set maxFilesPerTrigger to 100k files, a subsequent batch 
> using the cached max of 10k files is causing the job to take longer since the 
> cluster is capable of handling the 100k files but is stuck doing 10% of the 
> workload.  The benefit of the caching doesn't outweigh the cost of the 
> performance on the rest of the job.
>  
> With config settings available for this, we could either absorb some 
> increased driver memory usage for caching the next 100k files, or opt to 
> disable caching entirely and just relist files each batch by setting the 
> cache amount to 0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48333) Test `test_sorting_functions_with_column` with same `Column`

2024-05-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48333:
---
Labels: pull-request-available  (was: )

> Test `test_sorting_functions_with_column` with same `Column`
> 
>
> Key: SPARK-48333
> URL: https://issues.apache.org/jira/browse/SPARK-48333
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48333) Test `test_sorting_functions_with_column` with same `Column`

2024-05-19 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-48333:
-

 Summary: Test `test_sorting_functions_with_column` with same 
`Column`
 Key: SPARK-48333
 URL: https://issues.apache.org/jira/browse/SPARK-48333
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark, Tests
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org