[jira] [Resolved] (SPARK-48335) Make `_parse_datatype_string` compatible with Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-48335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-48335. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46657 [https://github.com/apache/spark/pull/46657] > Make `_parse_datatype_string` compatible with Spark Connect > --- > > Key: SPARK-48335 > URL: https://issues.apache.org/jira/browse/SPARK-48335 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48335) Make `_parse_datatype_string` compatible with Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-48335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-48335: - Assignee: Ruifeng Zheng > Make `_parse_datatype_string` compatible with Spark Connect > --- > > Key: SPARK-48335 > URL: https://issues.apache.org/jira/browse/SPARK-48335 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48335) Make `_parse_datatype_string` compatible with Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-48335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48335: --- Labels: pull-request-available (was: ) > Make `_parse_datatype_string` compatible with Spark Connect > --- > > Key: SPARK-48335 > URL: https://issues.apache.org/jira/browse/SPARK-48335 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48335) Make `_parse_datatype_string` compatible with Spark Connect
Ruifeng Zheng created SPARK-48335: - Summary: Make `_parse_datatype_string` compatible with Spark Connect Key: SPARK-48335 URL: https://issues.apache.org/jira/browse/SPARK-48335 Project: Spark Issue Type: Bug Components: Connect, PySpark Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48335) Make `_parse_datatype_string` compatible with Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-48335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-48335: -- Issue Type: Improvement (was: Bug) > Make `_parse_datatype_string` compatible with Spark Connect > --- > > Key: SPARK-48335 > URL: https://issues.apache.org/jira/browse/SPARK-48335 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48333) Test `test_sorting_functions_with_column` with same `Column`
[ https://issues.apache.org/jira/browse/SPARK-48333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-48333. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46654 [https://github.com/apache/spark/pull/46654] > Test `test_sorting_functions_with_column` with same `Column` > > > Key: SPARK-48333 > URL: https://issues.apache.org/jira/browse/SPARK-48333 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark, Tests >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48333) Test `test_sorting_functions_with_column` with same `Column`
[ https://issues.apache.org/jira/browse/SPARK-48333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-48333: Assignee: Ruifeng Zheng > Test `test_sorting_functions_with_column` with same `Column` > > > Key: SPARK-48333 > URL: https://issues.apache.org/jira/browse/SPARK-48333 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark, Tests >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48334) NettyServer doesn't shutdown if SparkContext initialize failed
IsisPolei created SPARK-48334: - Summary: NettyServer doesn't shutdown if SparkContext initialize failed Key: SPARK-48334 URL: https://issues.apache.org/jira/browse/SPARK-48334 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.1.3 Reporter: IsisPolei When obtaining a SparkContext instance using SparkContext.getOrCreate(), if an exception occurs during initialization (such as using incorrect Spark parameters, e.g., spark.executor.memory=1 without units), the RpcServer started during this period will not be shut down, resulting in the port being occupied indefinitely. The action to close the RpcServer happens in _env.stop(), where rpcEnv.shutdown() is executed, but this action only occurs when _env != null (SparkContext.scala:2106, version 3.1.3). However, the error occurs during initialization, and _env is not instantiated, so _env.stop() will not be executed, leading to the RpcServer not being closed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44924) Add configurations for FileStreamSource cached files
[ https://issues.apache.org/jira/browse/SPARK-44924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-44924: Assignee: kevin nacios > Add configurations for FileStreamSource cached files > > > Key: SPARK-44924 > URL: https://issues.apache.org/jira/browse/SPARK-44924 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: kevin nacios >Assignee: kevin nacios >Priority: Minor > Labels: pull-request-available > > With https://issues.apache.org/jira/browse/SPARK-30866, caching of listed > files was added for structured streaming to reduce cost of relisting from > filesystem each batch. The settings that drive this are currently hardcoded > and there is no way to change them. > > This impacts some of our workloads where we process large datasets where its > unknown how "heavy" some files are, so a single batch can take a long period > of time. When we set maxFilesPerTrigger to 100k files, a subsequent batch > using the cached max of 10k files is causing the job to take longer since the > cluster is capable of handling the 100k files but is stuck doing 10% of the > workload. The benefit of the caching doesn't outweigh the cost of the > performance on the rest of the job. > > With config settings available for this, we could either absorb some > increased driver memory usage for caching the next 100k files, or opt to > disable caching entirely and just relist files each batch by setting the > cache amount to 0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44924) Add configurations for FileStreamSource cached files
[ https://issues.apache.org/jira/browse/SPARK-44924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-44924. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45362 [https://github.com/apache/spark/pull/45362] > Add configurations for FileStreamSource cached files > > > Key: SPARK-44924 > URL: https://issues.apache.org/jira/browse/SPARK-44924 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: kevin nacios >Assignee: kevin nacios >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > With https://issues.apache.org/jira/browse/SPARK-30866, caching of listed > files was added for structured streaming to reduce cost of relisting from > filesystem each batch. The settings that drive this are currently hardcoded > and there is no way to change them. > > This impacts some of our workloads where we process large datasets where its > unknown how "heavy" some files are, so a single batch can take a long period > of time. When we set maxFilesPerTrigger to 100k files, a subsequent batch > using the cached max of 10k files is causing the job to take longer since the > cluster is capable of handling the 100k files but is stuck doing 10% of the > workload. The benefit of the caching doesn't outweigh the cost of the > performance on the rest of the job. > > With config settings available for this, we could either absorb some > increased driver memory usage for caching the next 100k files, or opt to > disable caching entirely and just relist files each batch by setting the > cache amount to 0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48333) Test `test_sorting_functions_with_column` with same `Column`
[ https://issues.apache.org/jira/browse/SPARK-48333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48333: --- Labels: pull-request-available (was: ) > Test `test_sorting_functions_with_column` with same `Column` > > > Key: SPARK-48333 > URL: https://issues.apache.org/jira/browse/SPARK-48333 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark, Tests >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48333) Test `test_sorting_functions_with_column` with same `Column`
Ruifeng Zheng created SPARK-48333: - Summary: Test `test_sorting_functions_with_column` with same `Column` Key: SPARK-48333 URL: https://issues.apache.org/jira/browse/SPARK-48333 Project: Spark Issue Type: Sub-task Components: Connect, PySpark, Tests Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org