[jira] [Updated] (SPARK-47458) Incorrect to calculate the concurrent task number
[ https://issues.apache.org/jira/browse/SPARK-47458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bobby Wang updated SPARK-47458: --- Summary: Incorrect to calculate the concurrent task number (was: Wrong to calculate the concurrent task number) > Incorrect to calculate the concurrent task number > - > > Key: SPARK-47458 > URL: https://issues.apache.org/jira/browse/SPARK-47458 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Bobby Wang >Priority: Major > > The below test case failed, > > {code:java} > test("problem of calculating the maximum concurrent task") { > withTempDir { dir => > val discoveryScript = createTempScriptWithExpectedOutput( > dir, "gpuDiscoveryScript", """{"name": "gpu","addresses":["0", "1", > "2", "3"]}""") > val conf = new SparkConf() > // Setup a local cluster which would only has one executor with 2 CPUs > and 1 GPU. > .setMaster("local-cluster[1, 6, 1024]") > .setAppName("test-cluster") > .set(WORKER_GPU_ID.amountConf, "4") > .set(WORKER_GPU_ID.discoveryScriptConf, discoveryScript) > .set(EXECUTOR_GPU_ID.amountConf, "4") > .set(TASK_GPU_ID.amountConf, "2") > // disable barrier stage retry to fail the application as soon as > possible > .set(BARRIER_MAX_CONCURRENT_TASKS_CHECK_MAX_FAILURES, 1) > sc = new SparkContext(conf) > TestUtils.waitUntilExecutorsUp(sc, 1, 6) > // Setup a barrier stage which contains 2 tasks and each task requires 1 > CPU and 1 GPU. > // Therefore, the total resources requirement (2 CPUs and 2 GPUs) of this > barrier stage > // can not be satisfied since the cluster only has 2 CPUs and 1 GPU in > total. > assert(sc.parallelize(Range(1, 10), 2) > .barrier() > .mapPartitions { iter => iter } > .collect() sameElements Range(1, 10).toArray[Int]) > } > } {code} > The error log > > > [SPARK-24819]: Barrier execution mode does not allow run a barrier stage that > requires more slots than the total number of slots in the cluster currently. > Please init a new cluster with more resources(e.g. CPU, GPU) or repartition > the input RDD(s) to reduce the number of slots required to run this barrier > stage. > org.apache.spark.scheduler.BarrierJobSlotsNumberCheckFailed: [SPARK-24819]: > Barrier execution mode does not allow run a barrier stage that requires more > slots than the total number of slots in the cluster currently. Please init a > new cluster with more resources(e.g. CPU, GPU) or repartition the input > RDD(s) to reduce the number of slots required to run this barrier stage. > at > org.apache.spark.errors.SparkCoreErrors$.numPartitionsGreaterThanMaxNumConcurrentTasksError(SparkCoreErrors.scala:241) > at > org.apache.spark.scheduler.DAGScheduler.checkBarrierStageWithNumSlots(DAGScheduler.scala:576) > at > org.apache.spark.scheduler.DAGScheduler.createResultStage(DAGScheduler.scala:654) > at > org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1321) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3055) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3046) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3035) > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47459) Cancel running stage if the result is empty relation
Yuming Wang created SPARK-47459: --- Summary: Cancel running stage if the result is empty relation Key: SPARK-47459 URL: https://issues.apache.org/jira/browse/SPARK-47459 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.1 Reporter: Yuming Wang Attachments: task stack trace.png How to reproduce: bin/spark-sql --master yarn --conf spark.driver.host=10.211.174.53 {code:sql} set spark.sql.adaptive.enabled=true; select a from (select id as a, id as b, id as z from range(1)) t1 join (select id as c, id as d from range(2)) t2 on t1.a = t2.c join (select id as e, id as f from range(3)) t3 on t2.d = t3.e where z % 10 < 0 group by 1; {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47459) Cancel running stage if the result is empty relation
[ https://issues.apache.org/jira/browse/SPARK-47459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-47459: Attachment: task stack trace.png > Cancel running stage if the result is empty relation > > > Key: SPARK-47459 > URL: https://issues.apache.org/jira/browse/SPARK-47459 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.1 >Reporter: Yuming Wang >Priority: Major > Attachments: task stack trace.png > > > How to reproduce: > bin/spark-sql --master yarn --conf spark.driver.host=10.211.174.53 > {code:sql} > set spark.sql.adaptive.enabled=true; > select a from (select id as a, id as b, id as z from range(1)) t1 > join (select id as c, id as d from range(2)) t2 on t1.a = t2.c > join (select id as e, id as f from range(3)) t3 on t2.d = t3.e > where z % 10 < 0 > group by 1; > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47458) Wrong to calculate the concurrent task number
Bobby Wang created SPARK-47458: -- Summary: Wrong to calculate the concurrent task number Key: SPARK-47458 URL: https://issues.apache.org/jira/browse/SPARK-47458 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 4.0.0 Reporter: Bobby Wang The below test case failed, {code:java} test("problem of calculating the maximum concurrent task") { withTempDir { dir => val discoveryScript = createTempScriptWithExpectedOutput( dir, "gpuDiscoveryScript", """{"name": "gpu","addresses":["0", "1", "2", "3"]}""") val conf = new SparkConf() // Setup a local cluster which would only has one executor with 2 CPUs and 1 GPU. .setMaster("local-cluster[1, 6, 1024]") .setAppName("test-cluster") .set(WORKER_GPU_ID.amountConf, "4") .set(WORKER_GPU_ID.discoveryScriptConf, discoveryScript) .set(EXECUTOR_GPU_ID.amountConf, "4") .set(TASK_GPU_ID.amountConf, "2") // disable barrier stage retry to fail the application as soon as possible .set(BARRIER_MAX_CONCURRENT_TASKS_CHECK_MAX_FAILURES, 1) sc = new SparkContext(conf) TestUtils.waitUntilExecutorsUp(sc, 1, 6) // Setup a barrier stage which contains 2 tasks and each task requires 1 CPU and 1 GPU. // Therefore, the total resources requirement (2 CPUs and 2 GPUs) of this barrier stage // can not be satisfied since the cluster only has 2 CPUs and 1 GPU in total. assert(sc.parallelize(Range(1, 10), 2) .barrier() .mapPartitions { iter => iter } .collect() sameElements Range(1, 10).toArray[Int]) } } {code} The error log [SPARK-24819]: Barrier execution mode does not allow run a barrier stage that requires more slots than the total number of slots in the cluster currently. Please init a new cluster with more resources(e.g. CPU, GPU) or repartition the input RDD(s) to reduce the number of slots required to run this barrier stage. org.apache.spark.scheduler.BarrierJobSlotsNumberCheckFailed: [SPARK-24819]: Barrier execution mode does not allow run a barrier stage that requires more slots than the total number of slots in the cluster currently. Please init a new cluster with more resources(e.g. CPU, GPU) or repartition the input RDD(s) to reduce the number of slots required to run this barrier stage. at org.apache.spark.errors.SparkCoreErrors$.numPartitionsGreaterThanMaxNumConcurrentTasksError(SparkCoreErrors.scala:241) at org.apache.spark.scheduler.DAGScheduler.checkBarrierStageWithNumSlots(DAGScheduler.scala:576) at org.apache.spark.scheduler.DAGScheduler.createResultStage(DAGScheduler.scala:654) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1321) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3055) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3046) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3035) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47457) Fix `IsolatedClientLoader.supportsHadoopShadedClient` to handle Hadoop 3.4+
[ https://issues.apache.org/jira/browse/SPARK-47457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-47457: -- Summary: Fix `IsolatedClientLoader.supportsHadoopShadedClient` to handle Hadoop 3.4+ (was: Fix `IsolatedClientLoader.supportsHadoopShadedClient` to handle Hadoop 3.4) > Fix `IsolatedClientLoader.supportsHadoopShadedClient` to handle Hadoop 3.4+ > --- > > Key: SPARK-47457 > URL: https://issues.apache.org/jira/browse/SPARK-47457 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47457) Fix `IsolatedClientLoader.supportsHadoopShadedClient` to handle Hadoop 3.4
[ https://issues.apache.org/jira/browse/SPARK-47457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-47457: - Assignee: Dongjoon Hyun > Fix `IsolatedClientLoader.supportsHadoopShadedClient` to handle Hadoop 3.4 > -- > > Key: SPARK-47457 > URL: https://issues.apache.org/jira/browse/SPARK-47457 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47457) Fix `IsolatedClientLoader.supportsHadoopShadedClient` to handle Hadoop 3.4
[ https://issues.apache.org/jira/browse/SPARK-47457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-47457: -- Component/s: SQL (was: Spark Core) > Fix `IsolatedClientLoader.supportsHadoopShadedClient` to handle Hadoop 3.4 > -- > > Key: SPARK-47457 > URL: https://issues.apache.org/jira/browse/SPARK-47457 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47457) Fix `IsolatedClientLoader.supportsHadoopShadedClient` to handle Hadoop 3.4
[ https://issues.apache.org/jira/browse/SPARK-47457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47457: --- Labels: pull-request-available (was: ) > Fix `IsolatedClientLoader.supportsHadoopShadedClient` to handle Hadoop 3.4 > -- > > Key: SPARK-47457 > URL: https://issues.apache.org/jira/browse/SPARK-47457 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47457) Fix `IsolatedClientLoader.supportsHadoopShadedClient` to handle Hadoop 3.4
Dongjoon Hyun created SPARK-47457: - Summary: Fix `IsolatedClientLoader.supportsHadoopShadedClient` to handle Hadoop 3.4 Key: SPARK-47457 URL: https://issues.apache.org/jira/browse/SPARK-47457 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47452) Use `Ubuntu 22.04` in `dev/infra/Dockerfile`
[ https://issues.apache.org/jira/browse/SPARK-47452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47452. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45576 [https://github.com/apache/spark/pull/45576] > Use `Ubuntu 22.04` in `dev/infra/Dockerfile` > > > Key: SPARK-47452 > URL: https://issues.apache.org/jira/browse/SPARK-47452 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47456) Support ORC Brotli codec
[ https://issues.apache.org/jira/browse/SPARK-47456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-47456: -- Parent: SPARK-44111 Issue Type: Sub-task (was: Improvement) > Support ORC Brotli codec > > > Key: SPARK-47456 > URL: https://issues.apache.org/jira/browse/SPARK-47456 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47456) Support ORC Brotli codec
[ https://issues.apache.org/jira/browse/SPARK-47456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-47456: - Assignee: dzcxzl > Support ORC Brotli codec > > > Key: SPARK-47456 > URL: https://issues.apache.org/jira/browse/SPARK-47456 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47456) Support ORC Brotli codec
[ https://issues.apache.org/jira/browse/SPARK-47456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47456. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45584 [https://github.com/apache/spark/pull/45584] > Support ORC Brotli codec > > > Key: SPARK-47456 > URL: https://issues.apache.org/jira/browse/SPARK-47456 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47435) SPARK-45561 causes mysql unsigned tinyint overflow
[ https://issues.apache.org/jira/browse/SPARK-47435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-47435: - Fix Version/s: 3.5.2 > SPARK-45561 causes mysql unsigned tinyint overflow > -- > > Key: SPARK-47435 > URL: https://issues.apache.org/jira/browse/SPARK-47435 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.2 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47453) Upgrade MySQL docker image version to 8.3.0
[ https://issues.apache.org/jira/browse/SPARK-47453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47453. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45581 [https://github.com/apache/spark/pull/45581] > Upgrade MySQL docker image version to 8.3.0 > --- > > Key: SPARK-47453 > URL: https://issues.apache.org/jira/browse/SPARK-47453 > Project: Spark > Issue Type: Sub-task > Components: Spark Docker, SQL, Tests >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47453) Upgrade MySQL docker image version to 8.3.0
[ https://issues.apache.org/jira/browse/SPARK-47453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-47453: - Assignee: Kent Yao > Upgrade MySQL docker image version to 8.3.0 > --- > > Key: SPARK-47453 > URL: https://issues.apache.org/jira/browse/SPARK-47453 > Project: Spark > Issue Type: Sub-task > Components: Spark Docker, SQL, Tests >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47422) Support collated strings in array operations
[ https://issues.apache.org/jira/browse/SPARK-47422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-47422. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45563 [https://github.com/apache/spark/pull/45563] > Support collated strings in array operations > > > Key: SPARK-47422 > URL: https://issues.apache.org/jira/browse/SPARK-47422 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Nikola Mandic >Assignee: Nikola Mandic >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Collations need to be properly supported in following array operations but > currently yield unexpected results: ArraysOverlap, ArrayDistinct, ArrayUnion, > ArrayIntersect, ArrayExcept. Example query: > {code:java} > select array_contains(array('aaa' collate utf8_binary_lcase), 'AAA' collate > utf8_binary_lcase){code} > We would expect the result of query to be true. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47422) Support collated strings in array operations
[ https://issues.apache.org/jira/browse/SPARK-47422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-47422: --- Assignee: Nikola Mandic > Support collated strings in array operations > > > Key: SPARK-47422 > URL: https://issues.apache.org/jira/browse/SPARK-47422 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Nikola Mandic >Assignee: Nikola Mandic >Priority: Major > Labels: pull-request-available > > Collations need to be properly supported in following array operations but > currently yield unexpected results: ArraysOverlap, ArrayDistinct, ArrayUnion, > ArrayIntersect, ArrayExcept. Example query: > {code:java} > select array_contains(array('aaa' collate utf8_binary_lcase), 'AAA' collate > utf8_binary_lcase){code} > We would expect the result of query to be true. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47456) Support ORC Brotli codec
[ https://issues.apache.org/jira/browse/SPARK-47456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47456: --- Labels: pull-request-available (was: ) > Support ORC Brotli codec > > > Key: SPARK-47456 > URL: https://issues.apache.org/jira/browse/SPARK-47456 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: dzcxzl >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47456) Support ORC Brotli codec
dzcxzl created SPARK-47456: -- Summary: Support ORC Brotli codec Key: SPARK-47456 URL: https://issues.apache.org/jira/browse/SPARK-47456 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: dzcxzl -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45393) Upgrade Hadoop to 3.4.0
[ https://issues.apache.org/jira/browse/SPARK-45393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45393: --- Labels: pull-request-available (was: ) > Upgrade Hadoop to 3.4.0 > --- > > Key: SPARK-45393 > URL: https://issues.apache.org/jira/browse/SPARK-45393 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47455) Fix Resource Handling of `scalaStyleOnCompileConfig` in SparkBuild.scala
[ https://issues.apache.org/jira/browse/SPARK-47455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47455: --- Labels: pull-request-available (was: ) > Fix Resource Handling of `scalaStyleOnCompileConfig` in SparkBuild.scala > > > Key: SPARK-47455 > URL: https://issues.apache.org/jira/browse/SPARK-47455 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.4.2, 4.0.0, 3.5.1 >Reporter: Yang Jie >Priority: Minor > Labels: pull-request-available > > [https://github.com/apache/spark/blob/e01ed0da22f24204fe23143032ff39be7f4b56af/project/SparkBuild.scala#L157-L173] > > {code:java} > val scalaStyleOnCompileConfig: String = { > val in = "scalastyle-config.xml" > val out = "scalastyle-on-compile.generated.xml" > val replacements = Map( > """customId="println" level="error -> """customId="println" > level="warn > ) > var contents = Source.fromFile(in).getLines.mkString("\n") > for ((k, v) <- replacements) { > require(contents.contains(k), s"Could not rewrite '$k' in original > scalastyle config.") > contents = contents.replace(k, v) > } > new PrintWriter(out) { > write(contents) > close() > } > out > } {code} > `Source.fromFile(in)` opens a `BufferedSource` resource handle, but it does > not close it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47455) Fix Resource Handling of `scalaStyleOnCompileConfig` in SparkBuild.scala
Yang Jie created SPARK-47455: Summary: Fix Resource Handling of `scalaStyleOnCompileConfig` in SparkBuild.scala Key: SPARK-47455 URL: https://issues.apache.org/jira/browse/SPARK-47455 Project: Spark Issue Type: Bug Components: Project Infra Affects Versions: 3.5.1, 3.4.2, 4.0.0 Reporter: Yang Jie [https://github.com/apache/spark/blob/e01ed0da22f24204fe23143032ff39be7f4b56af/project/SparkBuild.scala#L157-L173] {code:java} val scalaStyleOnCompileConfig: String = { val in = "scalastyle-config.xml" val out = "scalastyle-on-compile.generated.xml" val replacements = Map( """customId="println" level="error -> """customId="println" level="warn ) var contents = Source.fromFile(in).getLines.mkString("\n") for ((k, v) <- replacements) { require(contents.contains(k), s"Could not rewrite '$k' in original scalastyle config.") contents = contents.replace(k, v) } new PrintWriter(out) { write(contents) close() } out } {code} `Source.fromFile(in)` opens a `BufferedSource` resource handle, but it does not close it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47453) Upgrade MySQL docker image version to 8.3.0
[ https://issues.apache.org/jira/browse/SPARK-47453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47453: --- Labels: pull-request-available (was: ) > Upgrade MySQL docker image version to 8.3.0 > --- > > Key: SPARK-47453 > URL: https://issues.apache.org/jira/browse/SPARK-47453 > Project: Spark > Issue Type: Sub-task > Components: Spark Docker, SQL, Tests >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47453) Upgrade MySQL docker image version to 8.3.0
[ https://issues.apache.org/jira/browse/SPARK-47453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-47453: - Priority: Minor (was: Major) > Upgrade MySQL docker image version to 8.3.0 > --- > > Key: SPARK-47453 > URL: https://issues.apache.org/jira/browse/SPARK-47453 > Project: Spark > Issue Type: Sub-task > Components: Spark Docker, SQL, Tests >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47453) Upgrade MySQL docker image version to 8.3.0
Kent Yao created SPARK-47453: Summary: Upgrade MySQL docker image version to 8.3.0 Key: SPARK-47453 URL: https://issues.apache.org/jira/browse/SPARK-47453 Project: Spark Issue Type: Sub-task Components: Spark Docker, SQL, Tests Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47329) Persist df while using foreachbatch and stateful streaming query to prevent state from being re-loaded in each batch
[ https://issues.apache.org/jira/browse/SPARK-47329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-47329. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45432 [https://github.com/apache/spark/pull/45432] > Persist df while using foreachbatch and stateful streaming query to prevent > state from being re-loaded in each batch > > > Key: SPARK-47329 > URL: https://issues.apache.org/jira/browse/SPARK-47329 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Anish Shrigondekar >Assignee: Anish Shrigondekar >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Persist df while using foreachbatch and stateful streaming query to prevent > state from being re-loaded in each batch -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46990) Regression: Unable to load empty avro files emitted by event-hubs
[ https://issues.apache.org/jira/browse/SPARK-46990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828143#comment-17828143 ] Ivan Sadikov commented on SPARK-46990: -- Opened PR https://github.com/apache/spark/pull/45578. > Regression: Unable to load empty avro files emitted by event-hubs > - > > Key: SPARK-46990 > URL: https://issues.apache.org/jira/browse/SPARK-46990 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.5.0 > Environment: Databricks 14.0 - 14.3 (spark 3.5.0) >Reporter: Kamil Kandzia >Priority: Major > Labels: pull-request-available > Attachments: second=02.avro > > > In azure, I use databricks and event-hubs. Up until spark version 3.4.1 (in > databricks as 13.3 LTS) empty avro files emitted by event-hubs can be read. > Since version 3.5.0, it is impossible to load these files (even if I have > multiple avro files to load and one of them is empty, it can't perform an > operation like count or save). I tested this on databricks versions 14.0, > 14.1, 14.2, 14.3 and it doesn't work properly in any of them. > I use the following code: > > {code:java} > df = spark.read.format("avro") \ > .load('abfss://@.dfs.core.windows.net///0/2024/02/05/22/46/10.avro') > > df.count() <- in this operation the spark hangs{code} > I am sending a fragment of logs from databricks and query plan: > {code:java} > 24/02/06 10:03:10 INFO ProgressReporter$: Added result fetcher for > 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9 > 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 1; threshold: 32 > 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 24/02/06 10:03:11 INFO InMemoryFileIndex: It took 9 ms to list leaf files for > 1 paths. > 24/02/06 10:03:11 INFO ProgressReporter$: Removed result fetcher for > 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9 > 24/02/06 10:03:12 INFO ProgressReporter$: Added result fetcher for > 2734305632140666820_6526693737104909881_a07acddb350f44a284cac52db0b2fb21 > 24/02/06 10:03:12 INFO ClusterLoadMonitor: Added query with execution ID:38. > Current active queries:1 > 24/02/06 10:03:12 INFO FileSourceStrategy: Pushed Filters: > 24/02/06 10:03:12 INFO FileSourceStrategy: Post-Scan Filters: > 24/02/06 10:03:12 INFO CodeGenerator: Code generated in 10.636308 ms > 24/02/06 10:03:12 INFO MemoryStore: Block broadcast_34 stored as values in > memory (estimated size 409.5 KiB, free 3.3 GiB) > 24/02/06 10:03:12 INFO MemoryStore: Block broadcast_34_piece0 stored as bytes > in memory (estimated size 14.5 KiB, free 3.3 GiB) > 24/02/06 10:03:12 INFO BlockManagerInfo: Added broadcast_34_piece0 in memory > on :43781 (size: 14.5 KiB, free: 3.3 GiB) > 24/02/06 10:03:12 INFO SparkContext: Created broadcast 34 from > $anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63 > 24/02/06 10:03:12 INFO FileSourceScanExec: Planning scan with bin packing, > max split size: 4194304 bytes, max partition size: 4194304, open cost is > considered as scanning 4194304 bytes. > 24/02/06 10:03:12 INFO DAGScheduler: Registering RDD 104 > ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) as input > to shuffle 11 > 24/02/06 10:03:12 INFO DAGScheduler: Got map stage job 22 > ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) with 1 > output partitions > 24/02/06 10:03:12 INFO DAGScheduler: Final stage: ShuffleMapStage 31 > ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) > 24/02/06 10:03:12 INFO DAGScheduler: Parents of final stage: List() > 24/02/06 10:03:12 INFO DAGScheduler: Missing parents: List() > 24/02/06 10:03:12 INFO DAGScheduler: Submitting ShuffleMapStage 31 > (MapPartitionsRDD[104] at $anonfun$withThreadLocalCaptured$5 at > LexicalThreadLocal.scala:63), which has no missing parents > 24/02/06 10:03:12 INFO DAGScheduler: Submitting 1 missing tasks from > ShuffleMapStage 31 (MapPartitionsRDD[104] at > $anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) (first 15 > tasks are for partitions Vector(0)) > 24/02/06 10:03:12 INFO TaskSchedulerImpl: Adding task set 31.0 with 1 tasks > resource profile 0 > 24/02/06 10:03:12 INFO TaskSetManager: TaskSet 31.0 using PreferredLocationsV1 > 24/02/06 10:03:12 WARN FairSchedulableBuilder: A job was submitted with > scheduler pool 2734305632140666820, which has not been configured. This can > happen when the file that pools are read from isn't set, or when that file > doesn't contain 2734305632140666820. Created 2734305632140666820 with default > configuration (schedulingMode: FIFO,
[jira] [Updated] (SPARK-46990) Regression: Unable to load empty avro files emitted by event-hubs
[ https://issues.apache.org/jira/browse/SPARK-46990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46990: --- Labels: pull-request-available (was: ) > Regression: Unable to load empty avro files emitted by event-hubs > - > > Key: SPARK-46990 > URL: https://issues.apache.org/jira/browse/SPARK-46990 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.5.0 > Environment: Databricks 14.0 - 14.3 (spark 3.5.0) >Reporter: Kamil Kandzia >Priority: Major > Labels: pull-request-available > Attachments: second=02.avro > > > In azure, I use databricks and event-hubs. Up until spark version 3.4.1 (in > databricks as 13.3 LTS) empty avro files emitted by event-hubs can be read. > Since version 3.5.0, it is impossible to load these files (even if I have > multiple avro files to load and one of them is empty, it can't perform an > operation like count or save). I tested this on databricks versions 14.0, > 14.1, 14.2, 14.3 and it doesn't work properly in any of them. > I use the following code: > > {code:java} > df = spark.read.format("avro") \ > .load('abfss://@.dfs.core.windows.net///0/2024/02/05/22/46/10.avro') > > df.count() <- in this operation the spark hangs{code} > I am sending a fragment of logs from databricks and query plan: > {code:java} > 24/02/06 10:03:10 INFO ProgressReporter$: Added result fetcher for > 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9 > 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 1; threshold: 32 > 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 24/02/06 10:03:11 INFO InMemoryFileIndex: It took 9 ms to list leaf files for > 1 paths. > 24/02/06 10:03:11 INFO ProgressReporter$: Removed result fetcher for > 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9 > 24/02/06 10:03:12 INFO ProgressReporter$: Added result fetcher for > 2734305632140666820_6526693737104909881_a07acddb350f44a284cac52db0b2fb21 > 24/02/06 10:03:12 INFO ClusterLoadMonitor: Added query with execution ID:38. > Current active queries:1 > 24/02/06 10:03:12 INFO FileSourceStrategy: Pushed Filters: > 24/02/06 10:03:12 INFO FileSourceStrategy: Post-Scan Filters: > 24/02/06 10:03:12 INFO CodeGenerator: Code generated in 10.636308 ms > 24/02/06 10:03:12 INFO MemoryStore: Block broadcast_34 stored as values in > memory (estimated size 409.5 KiB, free 3.3 GiB) > 24/02/06 10:03:12 INFO MemoryStore: Block broadcast_34_piece0 stored as bytes > in memory (estimated size 14.5 KiB, free 3.3 GiB) > 24/02/06 10:03:12 INFO BlockManagerInfo: Added broadcast_34_piece0 in memory > on :43781 (size: 14.5 KiB, free: 3.3 GiB) > 24/02/06 10:03:12 INFO SparkContext: Created broadcast 34 from > $anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63 > 24/02/06 10:03:12 INFO FileSourceScanExec: Planning scan with bin packing, > max split size: 4194304 bytes, max partition size: 4194304, open cost is > considered as scanning 4194304 bytes. > 24/02/06 10:03:12 INFO DAGScheduler: Registering RDD 104 > ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) as input > to shuffle 11 > 24/02/06 10:03:12 INFO DAGScheduler: Got map stage job 22 > ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) with 1 > output partitions > 24/02/06 10:03:12 INFO DAGScheduler: Final stage: ShuffleMapStage 31 > ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) > 24/02/06 10:03:12 INFO DAGScheduler: Parents of final stage: List() > 24/02/06 10:03:12 INFO DAGScheduler: Missing parents: List() > 24/02/06 10:03:12 INFO DAGScheduler: Submitting ShuffleMapStage 31 > (MapPartitionsRDD[104] at $anonfun$withThreadLocalCaptured$5 at > LexicalThreadLocal.scala:63), which has no missing parents > 24/02/06 10:03:12 INFO DAGScheduler: Submitting 1 missing tasks from > ShuffleMapStage 31 (MapPartitionsRDD[104] at > $anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) (first 15 > tasks are for partitions Vector(0)) > 24/02/06 10:03:12 INFO TaskSchedulerImpl: Adding task set 31.0 with 1 tasks > resource profile 0 > 24/02/06 10:03:12 INFO TaskSetManager: TaskSet 31.0 using PreferredLocationsV1 > 24/02/06 10:03:12 WARN FairSchedulableBuilder: A job was submitted with > scheduler pool 2734305632140666820, which has not been configured. This can > happen when the file that pools are read from isn't set, or when that file > doesn't contain 2734305632140666820. Created 2734305632140666820 with default > configuration (schedulingMode: FIFO, minShare: 0, weight: 1) > 24/02/06 10:03:12 INFO FairSchedulabl
[jira] [Updated] (SPARK-47452) Use `Ubuntu 22.04` in `dev/infra/Dockerfile`
[ https://issues.apache.org/jira/browse/SPARK-47452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-47452: -- Summary: Use `Ubuntu 22.04` in `dev/infra/Dockerfile` (was: Use Ubuntu 22.04 in `dev/infra/Dockerfile`) > Use `Ubuntu 22.04` in `dev/infra/Dockerfile` > > > Key: SPARK-47452 > URL: https://issues.apache.org/jira/browse/SPARK-47452 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47452) Use `Ubuntu 22.04` in `dev/infra/Dockerfile`
[ https://issues.apache.org/jira/browse/SPARK-47452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47452: --- Labels: pull-request-available (was: ) > Use `Ubuntu 22.04` in `dev/infra/Dockerfile` > > > Key: SPARK-47452 > URL: https://issues.apache.org/jira/browse/SPARK-47452 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47452) Use Ubuntu 22.04 in `dev/infra/Dockerfile`
Dongjoon Hyun created SPARK-47452: - Summary: Use Ubuntu 22.04 in `dev/infra/Dockerfile` Key: SPARK-47452 URL: https://issues.apache.org/jira/browse/SPARK-47452 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47450) Use R 4.3.3 in `windows` R GitHub Action job
[ https://issues.apache.org/jira/browse/SPARK-47450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47450. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45574 [https://github.com/apache/spark/pull/45574] > Use R 4.3.3 in `windows` R GitHub Action job > > > Key: SPARK-47450 > URL: https://issues.apache.org/jira/browse/SPARK-47450 > Project: Spark > Issue Type: Sub-task > Components: Project Infra, R >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47448) Enable spark.shuffle.service.removeShuffle by default
[ https://issues.apache.org/jira/browse/SPARK-47448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47448. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45572 [https://github.com/apache/spark/pull/45572] > Enable spark.shuffle.service.removeShuffle by default > - > > Key: SPARK-47448 > URL: https://issues.apache.org/jira/browse/SPARK-47448 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47451) Support to_json(variant)
[ https://issues.apache.org/jira/browse/SPARK-47451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47451: --- Labels: pull-request-available (was: ) > Support to_json(variant) > > > Key: SPARK-47451 > URL: https://issues.apache.org/jira/browse/SPARK-47451 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Chenhao Li >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47451) Support to_json(variant)
Chenhao Li created SPARK-47451: -- Summary: Support to_json(variant) Key: SPARK-47451 URL: https://issues.apache.org/jira/browse/SPARK-47451 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Chenhao Li -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47450) Use R 4.3.3 in `windows` R GitHub Action job
[ https://issues.apache.org/jira/browse/SPARK-47450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47450: --- Labels: pull-request-available (was: ) > Use R 4.3.3 in `windows` R GitHub Action job > > > Key: SPARK-47450 > URL: https://issues.apache.org/jira/browse/SPARK-47450 > Project: Spark > Issue Type: Sub-task > Components: Project Infra, R >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45921) Use Hadoop 3.3.5 winutils in AppVeyor build
[ https://issues.apache.org/jira/browse/SPARK-45921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45921: -- Parent: SPARK-44111 Issue Type: Sub-task (was: Improvement) > Use Hadoop 3.3.5 winutils in AppVeyor build > --- > > Key: SPARK-45921 > URL: https://issues.apache.org/jira/browse/SPARK-45921 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45995) Upgrade R version from 4.3.1 to 4.3.2 in AppVeyor
[ https://issues.apache.org/jira/browse/SPARK-45995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45995: -- Parent: SPARK-44111 Issue Type: Sub-task (was: Improvement) > Upgrade R version from 4.3.1 to 4.3.2 in AppVeyor > - > > Key: SPARK-45995 > URL: https://issues.apache.org/jira/browse/SPARK-45995 > Project: Spark > Issue Type: Sub-task > Components: R >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > https://cran.r-project.org/doc/manuals/r-release/NEWS.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46990) Regression: Unable to load empty avro files emitted by event-hubs
[ https://issues.apache.org/jira/browse/SPARK-46990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828113#comment-17828113 ] Pavlo Pohrrebnyi commented on SPARK-46990: -- [~ivan.sadikov], feel free to use it. That is a standard EventHub capture file, with azure defined schema and no data inside. > Regression: Unable to load empty avro files emitted by event-hubs > - > > Key: SPARK-46990 > URL: https://issues.apache.org/jira/browse/SPARK-46990 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.5.0 > Environment: Databricks 14.0 - 14.3 (spark 3.5.0) >Reporter: Kamil Kandzia >Priority: Major > Attachments: second=02.avro > > > In azure, I use databricks and event-hubs. Up until spark version 3.4.1 (in > databricks as 13.3 LTS) empty avro files emitted by event-hubs can be read. > Since version 3.5.0, it is impossible to load these files (even if I have > multiple avro files to load and one of them is empty, it can't perform an > operation like count or save). I tested this on databricks versions 14.0, > 14.1, 14.2, 14.3 and it doesn't work properly in any of them. > I use the following code: > > {code:java} > df = spark.read.format("avro") \ > .load('abfss://@.dfs.core.windows.net///0/2024/02/05/22/46/10.avro') > > df.count() <- in this operation the spark hangs{code} > I am sending a fragment of logs from databricks and query plan: > {code:java} > 24/02/06 10:03:10 INFO ProgressReporter$: Added result fetcher for > 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9 > 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 1; threshold: 32 > 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 24/02/06 10:03:11 INFO InMemoryFileIndex: It took 9 ms to list leaf files for > 1 paths. > 24/02/06 10:03:11 INFO ProgressReporter$: Removed result fetcher for > 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9 > 24/02/06 10:03:12 INFO ProgressReporter$: Added result fetcher for > 2734305632140666820_6526693737104909881_a07acddb350f44a284cac52db0b2fb21 > 24/02/06 10:03:12 INFO ClusterLoadMonitor: Added query with execution ID:38. > Current active queries:1 > 24/02/06 10:03:12 INFO FileSourceStrategy: Pushed Filters: > 24/02/06 10:03:12 INFO FileSourceStrategy: Post-Scan Filters: > 24/02/06 10:03:12 INFO CodeGenerator: Code generated in 10.636308 ms > 24/02/06 10:03:12 INFO MemoryStore: Block broadcast_34 stored as values in > memory (estimated size 409.5 KiB, free 3.3 GiB) > 24/02/06 10:03:12 INFO MemoryStore: Block broadcast_34_piece0 stored as bytes > in memory (estimated size 14.5 KiB, free 3.3 GiB) > 24/02/06 10:03:12 INFO BlockManagerInfo: Added broadcast_34_piece0 in memory > on :43781 (size: 14.5 KiB, free: 3.3 GiB) > 24/02/06 10:03:12 INFO SparkContext: Created broadcast 34 from > $anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63 > 24/02/06 10:03:12 INFO FileSourceScanExec: Planning scan with bin packing, > max split size: 4194304 bytes, max partition size: 4194304, open cost is > considered as scanning 4194304 bytes. > 24/02/06 10:03:12 INFO DAGScheduler: Registering RDD 104 > ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) as input > to shuffle 11 > 24/02/06 10:03:12 INFO DAGScheduler: Got map stage job 22 > ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) with 1 > output partitions > 24/02/06 10:03:12 INFO DAGScheduler: Final stage: ShuffleMapStage 31 > ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) > 24/02/06 10:03:12 INFO DAGScheduler: Parents of final stage: List() > 24/02/06 10:03:12 INFO DAGScheduler: Missing parents: List() > 24/02/06 10:03:12 INFO DAGScheduler: Submitting ShuffleMapStage 31 > (MapPartitionsRDD[104] at $anonfun$withThreadLocalCaptured$5 at > LexicalThreadLocal.scala:63), which has no missing parents > 24/02/06 10:03:12 INFO DAGScheduler: Submitting 1 missing tasks from > ShuffleMapStage 31 (MapPartitionsRDD[104] at > $anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) (first 15 > tasks are for partitions Vector(0)) > 24/02/06 10:03:12 INFO TaskSchedulerImpl: Adding task set 31.0 with 1 tasks > resource profile 0 > 24/02/06 10:03:12 INFO TaskSetManager: TaskSet 31.0 using PreferredLocationsV1 > 24/02/06 10:03:12 WARN FairSchedulableBuilder: A job was submitted with > scheduler pool 2734305632140666820, which has not been configured. This can > happen when the file that pools are read from isn't set, or when that file > doesn't contain 2734305632140666820. Created 2734305632140666820 with default >
[jira] [Created] (SPARK-47449) Refactor and split list/timer unit tests
Jing Zhan created SPARK-47449: - Summary: Refactor and split list/timer unit tests Key: SPARK-47449 URL: https://issues.apache.org/jira/browse/SPARK-47449 Project: Spark Issue Type: Task Components: Structured Streaming Affects Versions: 4.0.0 Reporter: Jing Zhan Refactor ListState and timer related unit tests. As planned in test plan for state-v2, list/timer should be tested in both integration and unit tests. Currently timer related tests could be refactored to use base suite class in {{{}ValueStateSuite{}}}, and list state unit tests are needed in addition to {{{}TransformWithListStateSuite{}}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46990) Regression: Unable to load empty avro files emitted by event-hubs
[ https://issues.apache.org/jira/browse/SPARK-46990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828107#comment-17828107 ] Ivan Sadikov commented on SPARK-46990: -- Thanks, Kamil. I am still debugging, will try to open a PR with the fix today or tomorrow. [~pashashiz] Is it okay to use the provided sample file in the PR for a unit test? I will also reach out to Databricks to fix it there. > Regression: Unable to load empty avro files emitted by event-hubs > - > > Key: SPARK-46990 > URL: https://issues.apache.org/jira/browse/SPARK-46990 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.5.0 > Environment: Databricks 14.0 - 14.3 (spark 3.5.0) >Reporter: Kamil Kandzia >Priority: Major > Attachments: second=02.avro > > > In azure, I use databricks and event-hubs. Up until spark version 3.4.1 (in > databricks as 13.3 LTS) empty avro files emitted by event-hubs can be read. > Since version 3.5.0, it is impossible to load these files (even if I have > multiple avro files to load and one of them is empty, it can't perform an > operation like count or save). I tested this on databricks versions 14.0, > 14.1, 14.2, 14.3 and it doesn't work properly in any of them. > I use the following code: > > {code:java} > df = spark.read.format("avro") \ > .load('abfss://@.dfs.core.windows.net///0/2024/02/05/22/46/10.avro') > > df.count() <- in this operation the spark hangs{code} > I am sending a fragment of logs from databricks and query plan: > {code:java} > 24/02/06 10:03:10 INFO ProgressReporter$: Added result fetcher for > 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9 > 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 1; threshold: 32 > 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 24/02/06 10:03:11 INFO InMemoryFileIndex: It took 9 ms to list leaf files for > 1 paths. > 24/02/06 10:03:11 INFO ProgressReporter$: Removed result fetcher for > 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9 > 24/02/06 10:03:12 INFO ProgressReporter$: Added result fetcher for > 2734305632140666820_6526693737104909881_a07acddb350f44a284cac52db0b2fb21 > 24/02/06 10:03:12 INFO ClusterLoadMonitor: Added query with execution ID:38. > Current active queries:1 > 24/02/06 10:03:12 INFO FileSourceStrategy: Pushed Filters: > 24/02/06 10:03:12 INFO FileSourceStrategy: Post-Scan Filters: > 24/02/06 10:03:12 INFO CodeGenerator: Code generated in 10.636308 ms > 24/02/06 10:03:12 INFO MemoryStore: Block broadcast_34 stored as values in > memory (estimated size 409.5 KiB, free 3.3 GiB) > 24/02/06 10:03:12 INFO MemoryStore: Block broadcast_34_piece0 stored as bytes > in memory (estimated size 14.5 KiB, free 3.3 GiB) > 24/02/06 10:03:12 INFO BlockManagerInfo: Added broadcast_34_piece0 in memory > on :43781 (size: 14.5 KiB, free: 3.3 GiB) > 24/02/06 10:03:12 INFO SparkContext: Created broadcast 34 from > $anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63 > 24/02/06 10:03:12 INFO FileSourceScanExec: Planning scan with bin packing, > max split size: 4194304 bytes, max partition size: 4194304, open cost is > considered as scanning 4194304 bytes. > 24/02/06 10:03:12 INFO DAGScheduler: Registering RDD 104 > ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) as input > to shuffle 11 > 24/02/06 10:03:12 INFO DAGScheduler: Got map stage job 22 > ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) with 1 > output partitions > 24/02/06 10:03:12 INFO DAGScheduler: Final stage: ShuffleMapStage 31 > ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) > 24/02/06 10:03:12 INFO DAGScheduler: Parents of final stage: List() > 24/02/06 10:03:12 INFO DAGScheduler: Missing parents: List() > 24/02/06 10:03:12 INFO DAGScheduler: Submitting ShuffleMapStage 31 > (MapPartitionsRDD[104] at $anonfun$withThreadLocalCaptured$5 at > LexicalThreadLocal.scala:63), which has no missing parents > 24/02/06 10:03:12 INFO DAGScheduler: Submitting 1 missing tasks from > ShuffleMapStage 31 (MapPartitionsRDD[104] at > $anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) (first 15 > tasks are for partitions Vector(0)) > 24/02/06 10:03:12 INFO TaskSchedulerImpl: Adding task set 31.0 with 1 tasks > resource profile 0 > 24/02/06 10:03:12 INFO TaskSetManager: TaskSet 31.0 using PreferredLocationsV1 > 24/02/06 10:03:12 WARN FairSchedulableBuilder: A job was submitted with > scheduler pool 2734305632140666820, which has not been configured. This can > happen when the file that pools are read from isn't set, or whe
[jira] [Commented] (SPARK-46990) Regression: Unable to load empty avro files emitted by event-hubs
[ https://issues.apache.org/jira/browse/SPARK-46990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828106#comment-17828106 ] Kamil Kandzia commented on SPARK-46990: --- It is likely that the cause appeared earlier than you created your fixes. I observed this on databricks 14.0, which had a release in September 2023. According to details [Databricks Runtime 14.0 | Databricks on AWS|https://docs.databricks.com/en/release-notes/runtime/14.0.html] it doesn't contains your changes from SPARK-46633. > Regression: Unable to load empty avro files emitted by event-hubs > - > > Key: SPARK-46990 > URL: https://issues.apache.org/jira/browse/SPARK-46990 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.5.0 > Environment: Databricks 14.0 - 14.3 (spark 3.5.0) >Reporter: Kamil Kandzia >Priority: Major > Attachments: second=02.avro > > > In azure, I use databricks and event-hubs. Up until spark version 3.4.1 (in > databricks as 13.3 LTS) empty avro files emitted by event-hubs can be read. > Since version 3.5.0, it is impossible to load these files (even if I have > multiple avro files to load and one of them is empty, it can't perform an > operation like count or save). I tested this on databricks versions 14.0, > 14.1, 14.2, 14.3 and it doesn't work properly in any of them. > I use the following code: > > {code:java} > df = spark.read.format("avro") \ > .load('abfss://@.dfs.core.windows.net///0/2024/02/05/22/46/10.avro') > > df.count() <- in this operation the spark hangs{code} > I am sending a fragment of logs from databricks and query plan: > {code:java} > 24/02/06 10:03:10 INFO ProgressReporter$: Added result fetcher for > 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9 > 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 1; threshold: 32 > 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 24/02/06 10:03:11 INFO InMemoryFileIndex: It took 9 ms to list leaf files for > 1 paths. > 24/02/06 10:03:11 INFO ProgressReporter$: Removed result fetcher for > 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9 > 24/02/06 10:03:12 INFO ProgressReporter$: Added result fetcher for > 2734305632140666820_6526693737104909881_a07acddb350f44a284cac52db0b2fb21 > 24/02/06 10:03:12 INFO ClusterLoadMonitor: Added query with execution ID:38. > Current active queries:1 > 24/02/06 10:03:12 INFO FileSourceStrategy: Pushed Filters: > 24/02/06 10:03:12 INFO FileSourceStrategy: Post-Scan Filters: > 24/02/06 10:03:12 INFO CodeGenerator: Code generated in 10.636308 ms > 24/02/06 10:03:12 INFO MemoryStore: Block broadcast_34 stored as values in > memory (estimated size 409.5 KiB, free 3.3 GiB) > 24/02/06 10:03:12 INFO MemoryStore: Block broadcast_34_piece0 stored as bytes > in memory (estimated size 14.5 KiB, free 3.3 GiB) > 24/02/06 10:03:12 INFO BlockManagerInfo: Added broadcast_34_piece0 in memory > on :43781 (size: 14.5 KiB, free: 3.3 GiB) > 24/02/06 10:03:12 INFO SparkContext: Created broadcast 34 from > $anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63 > 24/02/06 10:03:12 INFO FileSourceScanExec: Planning scan with bin packing, > max split size: 4194304 bytes, max partition size: 4194304, open cost is > considered as scanning 4194304 bytes. > 24/02/06 10:03:12 INFO DAGScheduler: Registering RDD 104 > ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) as input > to shuffle 11 > 24/02/06 10:03:12 INFO DAGScheduler: Got map stage job 22 > ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) with 1 > output partitions > 24/02/06 10:03:12 INFO DAGScheduler: Final stage: ShuffleMapStage 31 > ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) > 24/02/06 10:03:12 INFO DAGScheduler: Parents of final stage: List() > 24/02/06 10:03:12 INFO DAGScheduler: Missing parents: List() > 24/02/06 10:03:12 INFO DAGScheduler: Submitting ShuffleMapStage 31 > (MapPartitionsRDD[104] at $anonfun$withThreadLocalCaptured$5 at > LexicalThreadLocal.scala:63), which has no missing parents > 24/02/06 10:03:12 INFO DAGScheduler: Submitting 1 missing tasks from > ShuffleMapStage 31 (MapPartitionsRDD[104] at > $anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) (first 15 > tasks are for partitions Vector(0)) > 24/02/06 10:03:12 INFO TaskSchedulerImpl: Adding task set 31.0 with 1 tasks > resource profile 0 > 24/02/06 10:03:12 INFO TaskSetManager: TaskSet 31.0 using PreferredLocationsV1 > 24/02/06 10:03:12 WARN FairSchedulableBuilder: A job was submitted with > scheduler pool 2734305632140666820, w
[jira] [Commented] (SPARK-46990) Regression: Unable to load empty avro files emitted by event-hubs
[ https://issues.apache.org/jira/browse/SPARK-46990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828100#comment-17828100 ] Ivan Sadikov commented on SPARK-46990: -- Yes, sure. Thanks for reporting. I will take a look and open a PR once I have root caused the issue. > Regression: Unable to load empty avro files emitted by event-hubs > - > > Key: SPARK-46990 > URL: https://issues.apache.org/jira/browse/SPARK-46990 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.5.0 > Environment: Databricks 14.0 - 14.3 (spark 3.5.0) >Reporter: Kamil Kandzia >Priority: Major > Attachments: second=02.avro > > > In azure, I use databricks and event-hubs. Up until spark version 3.4.1 (in > databricks as 13.3 LTS) empty avro files emitted by event-hubs can be read. > Since version 3.5.0, it is impossible to load these files (even if I have > multiple avro files to load and one of them is empty, it can't perform an > operation like count or save). I tested this on databricks versions 14.0, > 14.1, 14.2, 14.3 and it doesn't work properly in any of them. > I use the following code: > > {code:java} > df = spark.read.format("avro") \ > .load('abfss://@.dfs.core.windows.net///0/2024/02/05/22/46/10.avro') > > df.count() <- in this operation the spark hangs{code} > I am sending a fragment of logs from databricks and query plan: > {code:java} > 24/02/06 10:03:10 INFO ProgressReporter$: Added result fetcher for > 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9 > 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 1; threshold: 32 > 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 24/02/06 10:03:11 INFO InMemoryFileIndex: It took 9 ms to list leaf files for > 1 paths. > 24/02/06 10:03:11 INFO ProgressReporter$: Removed result fetcher for > 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9 > 24/02/06 10:03:12 INFO ProgressReporter$: Added result fetcher for > 2734305632140666820_6526693737104909881_a07acddb350f44a284cac52db0b2fb21 > 24/02/06 10:03:12 INFO ClusterLoadMonitor: Added query with execution ID:38. > Current active queries:1 > 24/02/06 10:03:12 INFO FileSourceStrategy: Pushed Filters: > 24/02/06 10:03:12 INFO FileSourceStrategy: Post-Scan Filters: > 24/02/06 10:03:12 INFO CodeGenerator: Code generated in 10.636308 ms > 24/02/06 10:03:12 INFO MemoryStore: Block broadcast_34 stored as values in > memory (estimated size 409.5 KiB, free 3.3 GiB) > 24/02/06 10:03:12 INFO MemoryStore: Block broadcast_34_piece0 stored as bytes > in memory (estimated size 14.5 KiB, free 3.3 GiB) > 24/02/06 10:03:12 INFO BlockManagerInfo: Added broadcast_34_piece0 in memory > on :43781 (size: 14.5 KiB, free: 3.3 GiB) > 24/02/06 10:03:12 INFO SparkContext: Created broadcast 34 from > $anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63 > 24/02/06 10:03:12 INFO FileSourceScanExec: Planning scan with bin packing, > max split size: 4194304 bytes, max partition size: 4194304, open cost is > considered as scanning 4194304 bytes. > 24/02/06 10:03:12 INFO DAGScheduler: Registering RDD 104 > ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) as input > to shuffle 11 > 24/02/06 10:03:12 INFO DAGScheduler: Got map stage job 22 > ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) with 1 > output partitions > 24/02/06 10:03:12 INFO DAGScheduler: Final stage: ShuffleMapStage 31 > ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) > 24/02/06 10:03:12 INFO DAGScheduler: Parents of final stage: List() > 24/02/06 10:03:12 INFO DAGScheduler: Missing parents: List() > 24/02/06 10:03:12 INFO DAGScheduler: Submitting ShuffleMapStage 31 > (MapPartitionsRDD[104] at $anonfun$withThreadLocalCaptured$5 at > LexicalThreadLocal.scala:63), which has no missing parents > 24/02/06 10:03:12 INFO DAGScheduler: Submitting 1 missing tasks from > ShuffleMapStage 31 (MapPartitionsRDD[104] at > $anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) (first 15 > tasks are for partitions Vector(0)) > 24/02/06 10:03:12 INFO TaskSchedulerImpl: Adding task set 31.0 with 1 tasks > resource profile 0 > 24/02/06 10:03:12 INFO TaskSetManager: TaskSet 31.0 using PreferredLocationsV1 > 24/02/06 10:03:12 WARN FairSchedulableBuilder: A job was submitted with > scheduler pool 2734305632140666820, which has not been configured. This can > happen when the file that pools are read from isn't set, or when that file > doesn't contain 2734305632140666820. Created 2734305632140666820 with default > configuration (schedulingMode: FI
[jira] [Commented] (SPARK-46990) Regression: Unable to load empty avro files emitted by event-hubs
[ https://issues.apache.org/jira/browse/SPARK-46990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828095#comment-17828095 ] Kamil Kandzia commented on SPARK-46990: --- Could you [~ivan.sadikov] look into issue? Pavlo has attached example avro file (I forgot to do this). > Regression: Unable to load empty avro files emitted by event-hubs > - > > Key: SPARK-46990 > URL: https://issues.apache.org/jira/browse/SPARK-46990 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.5.0 > Environment: Databricks 14.0 - 14.3 (spark 3.5.0) >Reporter: Kamil Kandzia >Priority: Major > Attachments: second=02.avro > > > In azure, I use databricks and event-hubs. Up until spark version 3.4.1 (in > databricks as 13.3 LTS) empty avro files emitted by event-hubs can be read. > Since version 3.5.0, it is impossible to load these files (even if I have > multiple avro files to load and one of them is empty, it can't perform an > operation like count or save). I tested this on databricks versions 14.0, > 14.1, 14.2, 14.3 and it doesn't work properly in any of them. > I use the following code: > > {code:java} > df = spark.read.format("avro") \ > .load('abfss://@.dfs.core.windows.net///0/2024/02/05/22/46/10.avro') > > df.count() <- in this operation the spark hangs{code} > I am sending a fragment of logs from databricks and query plan: > {code:java} > 24/02/06 10:03:10 INFO ProgressReporter$: Added result fetcher for > 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9 > 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 1; threshold: 32 > 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 24/02/06 10:03:11 INFO InMemoryFileIndex: It took 9 ms to list leaf files for > 1 paths. > 24/02/06 10:03:11 INFO ProgressReporter$: Removed result fetcher for > 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9 > 24/02/06 10:03:12 INFO ProgressReporter$: Added result fetcher for > 2734305632140666820_6526693737104909881_a07acddb350f44a284cac52db0b2fb21 > 24/02/06 10:03:12 INFO ClusterLoadMonitor: Added query with execution ID:38. > Current active queries:1 > 24/02/06 10:03:12 INFO FileSourceStrategy: Pushed Filters: > 24/02/06 10:03:12 INFO FileSourceStrategy: Post-Scan Filters: > 24/02/06 10:03:12 INFO CodeGenerator: Code generated in 10.636308 ms > 24/02/06 10:03:12 INFO MemoryStore: Block broadcast_34 stored as values in > memory (estimated size 409.5 KiB, free 3.3 GiB) > 24/02/06 10:03:12 INFO MemoryStore: Block broadcast_34_piece0 stored as bytes > in memory (estimated size 14.5 KiB, free 3.3 GiB) > 24/02/06 10:03:12 INFO BlockManagerInfo: Added broadcast_34_piece0 in memory > on :43781 (size: 14.5 KiB, free: 3.3 GiB) > 24/02/06 10:03:12 INFO SparkContext: Created broadcast 34 from > $anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63 > 24/02/06 10:03:12 INFO FileSourceScanExec: Planning scan with bin packing, > max split size: 4194304 bytes, max partition size: 4194304, open cost is > considered as scanning 4194304 bytes. > 24/02/06 10:03:12 INFO DAGScheduler: Registering RDD 104 > ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) as input > to shuffle 11 > 24/02/06 10:03:12 INFO DAGScheduler: Got map stage job 22 > ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) with 1 > output partitions > 24/02/06 10:03:12 INFO DAGScheduler: Final stage: ShuffleMapStage 31 > ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) > 24/02/06 10:03:12 INFO DAGScheduler: Parents of final stage: List() > 24/02/06 10:03:12 INFO DAGScheduler: Missing parents: List() > 24/02/06 10:03:12 INFO DAGScheduler: Submitting ShuffleMapStage 31 > (MapPartitionsRDD[104] at $anonfun$withThreadLocalCaptured$5 at > LexicalThreadLocal.scala:63), which has no missing parents > 24/02/06 10:03:12 INFO DAGScheduler: Submitting 1 missing tasks from > ShuffleMapStage 31 (MapPartitionsRDD[104] at > $anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) (first 15 > tasks are for partitions Vector(0)) > 24/02/06 10:03:12 INFO TaskSchedulerImpl: Adding task set 31.0 with 1 tasks > resource profile 0 > 24/02/06 10:03:12 INFO TaskSetManager: TaskSet 31.0 using PreferredLocationsV1 > 24/02/06 10:03:12 WARN FairSchedulableBuilder: A job was submitted with > scheduler pool 2734305632140666820, which has not been configured. This can > happen when the file that pools are read from isn't set, or when that file > doesn't contain 2734305632140666820. Created 2734305632140666820 with default > configuration (schedulingMode:
[jira] [Assigned] (SPARK-47448) Enable spark.shuffle.service.removeShuffle by default
[ https://issues.apache.org/jira/browse/SPARK-47448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-47448: - Assignee: Dongjoon Hyun > Enable spark.shuffle.service.removeShuffle by default > - > > Key: SPARK-47448 > URL: https://issues.apache.org/jira/browse/SPARK-47448 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45375) [CORE] Mark connection as timedOut in TransportClient.close
[ https://issues.apache.org/jira/browse/SPARK-45375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan reassigned SPARK-45375: --- Assignee: Hasnain Lakhani > [CORE] Mark connection as timedOut in TransportClient.close > --- > > Key: SPARK-45375 > URL: https://issues.apache.org/jira/browse/SPARK-45375 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 3.4.2, 4.0.0, 3.5.1, 3.3.4 >Reporter: Hasnain Lakhani >Assignee: Hasnain Lakhani >Priority: Major > > Avoid a race condition where a connection which is in the process of being > closed could be returned by the TransportClientFactory only to be immediately > closed and cause errors upon use > > This doesn't happen much in practice but is observed more frequently as part > of efforts to add SSL support -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-45374) [CORE] Add test keys for SSL functionality
[ https://issues.apache.org/jira/browse/SPARK-45374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828087#comment-17828087 ] Mridul Muralidharan edited comment on SPARK-45374 at 3/18/24 8:28 PM: -- Missed your query, you can link by: "more' -> link -> Web Link -> * URL == pr url * Link test == "GitHub Pull Request #" I did it for this PR, please lte me know if you are unable to do it for the others was (Author: mridulm80): Missed your query, you can link by: "more' -> link -> Web Link -> * URL == pr url * Link test == "GitHub Pull Request #" I did it for this PR > [CORE] Add test keys for SSL functionality > -- > > Key: SPARK-45374 > URL: https://issues.apache.org/jira/browse/SPARK-45374 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Hasnain Lakhani >Assignee: Hasnain Lakhani >Priority: Major > > Add test SSL keys which will be used for unit and integration tests of the > new SSL RPC functionality -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45374) [CORE] Add test keys for SSL functionality
[ https://issues.apache.org/jira/browse/SPARK-45374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan reassigned SPARK-45374: --- Assignee: Hasnain Lakhani > [CORE] Add test keys for SSL functionality > -- > > Key: SPARK-45374 > URL: https://issues.apache.org/jira/browse/SPARK-45374 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Hasnain Lakhani >Assignee: Hasnain Lakhani >Priority: Major > > Add test SSL keys which will be used for unit and integration tests of the > new SSL RPC functionality -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47448) Enable spark.shuffle.service.removeShuffle by default
[ https://issues.apache.org/jira/browse/SPARK-47448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47448: --- Labels: pull-request-available (was: ) > Enable spark.shuffle.service.removeShuffle by default > - > > Key: SPARK-47448 > URL: https://issues.apache.org/jira/browse/SPARK-47448 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47448) Enable spark.shuffle.service.removeShuffle by default
Dongjoon Hyun created SPARK-47448: - Summary: Enable spark.shuffle.service.removeShuffle by default Key: SPARK-47448 URL: https://issues.apache.org/jira/browse/SPARK-47448 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-46990) Regression: Unable to load empty avro files emitted by event-hubs
[ https://issues.apache.org/jira/browse/SPARK-46990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828090#comment-17828090 ] Pavlo Pohrrebnyi edited comment on SPARK-46990 at 3/18/24 8:22 PM: --- We are experiencing the same with Spark 3.5. That is likely caused by SPARK-46633. Here is the change [PR-44635|https://github.com/apache/spark/pull/44635/files#diff-c139f61eabcfcb9725c8caeb747becae061a2ea44f774b12c9cce5aeac102880] The job hangs once tries to read avro files with no records. It loops forever here: {code:java} org.apache.spark.sql.avro.AvroUtils$RowReader.hasNextRow(AvroUtils.scala:265) org.apache.spark.sql.avro.AvroUtils$RowReader.hasNextRow$(AvroUtils.scala:263) org.apache.spark.sql.avro.AvroFileFormat$$anon$1.hasNextRow(AvroFileFormat.scala:186) org.apache.spark.sql.avro.AvroFileFormat$$anon$1.hasNext(AvroFileFormat.scala:201) scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:604) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:798) org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.$anonfun$hasNext$1(FileScanRDD.scala:506) org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$Lambda$1269/1728924771.apply$mcZ$sp(Unknown Source) {code} Here is the sample to reproduce the issue: [^second=02.avro] was (Author: pashashiz): We are experiencing the same with Spark 3.5. That is likely caused by SPARK-46633. Here is the change [PR-44635|https://github.com/apache/spark/pull/44635/files#diff-c139f61eabcfcb9725c8caeb747becae061a2ea44f774b12c9cce5aeac102880]] The job hangs once tries to read avro files with no records. It loops forever here: {code:java} org.apache.spark.sql.avro.AvroUtils$RowReader.hasNextRow(AvroUtils.scala:265) org.apache.spark.sql.avro.AvroUtils$RowReader.hasNextRow$(AvroUtils.scala:263) org.apache.spark.sql.avro.AvroFileFormat$$anon$1.hasNextRow(AvroFileFormat.scala:186) org.apache.spark.sql.avro.AvroFileFormat$$anon$1.hasNext(AvroFileFormat.scala:201) scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:604) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:798) org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.$anonfun$hasNext$1(FileScanRDD.scala:506) org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$Lambda$1269/1728924771.apply$mcZ$sp(Unknown Source) {code} Here is the sample to reproduce the issue: [^second=02.avro] > Regression: Unable to load empty avro files emitted by event-hubs > - > > Key: SPARK-46990 > URL: https://issues.apache.org/jira/browse/SPARK-46990 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.5.0 > Environment: Databricks 14.0 - 14.3 (spark 3.5.0) >Reporter: Kamil Kandzia >Priority: Major > Attachments: second=02.avro > > > In azure, I use databricks and event-hubs. Up until spark version 3.4.1 (in > databricks as 13.3 LTS) empty avro files emitted by event-hubs can be read. > Since version 3.5.0, it is impossible to load these files (even if I have > multiple avro files to load and one of them is empty, it can't perform an > operation like count or save). I tested this on databricks versions 14.0, > 14.1, 14.2, 14.3 and it doesn't work properly in any of them. > I use the following code: > > {code:java} > df = spark.read.format("avro") \ > .load('abfss://@.dfs.core.windows.net///0/2024/02/05/22/46/10.avro') > > df.count() <- in this operation the spark hangs{code} > I am sending a fragment of logs from databricks and query plan: > {code:java} > 24/02/06 10:03:10 INFO ProgressReporter$: Added result fetcher for > 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9 > 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 1; threshold: 32 > 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 24/02/06 10:03:11 INFO InMemoryFileIndex: It took 9 ms to list leaf files for > 1 paths. > 24/02/06 10:03:11 INFO ProgressReporter$: Removed result fetcher for > 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9 > 24/02/06 10:03:12 INFO ProgressReporter$: Added result fetcher for > 2734305632140666820_6526693737104909881_a07acddb350f44a284cac52db0b2fb21 > 24/
[jira] [Comment Edited] (SPARK-46990) Regression: Unable to load empty avro files emitted by event-hubs
[ https://issues.apache.org/jira/browse/SPARK-46990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828090#comment-17828090 ] Pavlo Pohrrebnyi edited comment on SPARK-46990 at 3/18/24 8:21 PM: --- We are experiencing the same with Spark 3.5. That is likely caused by SPARK-46633. Here is the change [PR-44635|https://github.com/apache/spark/pull/44635/files#diff-c139f61eabcfcb9725c8caeb747becae061a2ea44f774b12c9cce5aeac102880]] The job hangs once tries to read avro files with no records. It loops forever here: {code:java} org.apache.spark.sql.avro.AvroUtils$RowReader.hasNextRow(AvroUtils.scala:265) org.apache.spark.sql.avro.AvroUtils$RowReader.hasNextRow$(AvroUtils.scala:263) org.apache.spark.sql.avro.AvroFileFormat$$anon$1.hasNextRow(AvroFileFormat.scala:186) org.apache.spark.sql.avro.AvroFileFormat$$anon$1.hasNext(AvroFileFormat.scala:201) scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:604) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:798) org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.$anonfun$hasNext$1(FileScanRDD.scala:506) org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$Lambda$1269/1728924771.apply$mcZ$sp(Unknown Source) {code} Here is the sample to reproduce the issue: [^second=02.avro] was (Author: pashashiz): We are experiencing the same with Spark 3.5. That is likely caused by SPARK-46633. Here is the change [PR-[44635|https://github.com/apache/spark/pull/44635/files#diff-c139f61eabcfcb9725c8caeb747becae061a2ea44f774b12c9cce5aeac102880]|https://github.com/apache/spark/pull/44635/files#diff-c139f61eabcfcb9725c8caeb747becae061a2ea44f774b12c9cce5aeac102880] The job hangs once tries to read avro files with no records. It loops forever here: {code:java} org.apache.spark.sql.avro.AvroUtils$RowReader.hasNextRow(AvroUtils.scala:265) org.apache.spark.sql.avro.AvroUtils$RowReader.hasNextRow$(AvroUtils.scala:263) org.apache.spark.sql.avro.AvroFileFormat$$anon$1.hasNextRow(AvroFileFormat.scala:186) org.apache.spark.sql.avro.AvroFileFormat$$anon$1.hasNext(AvroFileFormat.scala:201) scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:604) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:798) org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.$anonfun$hasNext$1(FileScanRDD.scala:506) org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$Lambda$1269/1728924771.apply$mcZ$sp(Unknown Source) {code} Here is the sample to reproduce the issue: [^second=02.avro] > Regression: Unable to load empty avro files emitted by event-hubs > - > > Key: SPARK-46990 > URL: https://issues.apache.org/jira/browse/SPARK-46990 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.5.0 > Environment: Databricks 14.0 - 14.3 (spark 3.5.0) >Reporter: Kamil Kandzia >Priority: Major > Attachments: second=02.avro > > > In azure, I use databricks and event-hubs. Up until spark version 3.4.1 (in > databricks as 13.3 LTS) empty avro files emitted by event-hubs can be read. > Since version 3.5.0, it is impossible to load these files (even if I have > multiple avro files to load and one of them is empty, it can't perform an > operation like count or save). I tested this on databricks versions 14.0, > 14.1, 14.2, 14.3 and it doesn't work properly in any of them. > I use the following code: > > {code:java} > df = spark.read.format("avro") \ > .load('abfss://@.dfs.core.windows.net///0/2024/02/05/22/46/10.avro') > > df.count() <- in this operation the spark hangs{code} > I am sending a fragment of logs from databricks and query plan: > {code:java} > 24/02/06 10:03:10 INFO ProgressReporter$: Added result fetcher for > 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9 > 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 1; threshold: 32 > 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 24/02/06 10:03:11 INFO InMemoryFileIndex: It took 9 ms to list leaf files for > 1 paths. > 24/02/06 10:03:11 INFO ProgressReporter$: Removed result fetcher for > 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9 > 24/02/06 10:03:12 INFO Prog
[jira] [Comment Edited] (SPARK-46990) Regression: Unable to load empty avro files emitted by event-hubs
[ https://issues.apache.org/jira/browse/SPARK-46990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828090#comment-17828090 ] Pavlo Pohrrebnyi edited comment on SPARK-46990 at 3/18/24 8:21 PM: --- We are experiencing the same with Spark 3.5. That is likely caused by SPARK-46633. Here is the change [PR-[44635|https://github.com/apache/spark/pull/44635/files#diff-c139f61eabcfcb9725c8caeb747becae061a2ea44f774b12c9cce5aeac102880]|https://github.com/apache/spark/pull/44635/files#diff-c139f61eabcfcb9725c8caeb747becae061a2ea44f774b12c9cce5aeac102880] The job hangs once tries to read avro files with no records. It loops forever here: {code:java} org.apache.spark.sql.avro.AvroUtils$RowReader.hasNextRow(AvroUtils.scala:265) org.apache.spark.sql.avro.AvroUtils$RowReader.hasNextRow$(AvroUtils.scala:263) org.apache.spark.sql.avro.AvroFileFormat$$anon$1.hasNextRow(AvroFileFormat.scala:186) org.apache.spark.sql.avro.AvroFileFormat$$anon$1.hasNext(AvroFileFormat.scala:201) scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:604) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:798) org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.$anonfun$hasNext$1(FileScanRDD.scala:506) org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$Lambda$1269/1728924771.apply$mcZ$sp(Unknown Source) {code} Here is the sample to reproduce the issue: [^second=02.avro] was (Author: pashashiz): We are experiencing the same with Spark 3.5. That is likely caused by [SPARK-46633|https://issues.apache.org/jira/browse/SPARK-46633]. Here is the change [PR-[44635|https://github.com/apache/spark/pull/44635/files#diff-c139f61eabcfcb9725c8caeb747becae061a2ea44f774b12c9cce5aeac102880]|[https://github.com/apache/spark/pull/44635/files#diff-c139f61eabcfcb9725c8caeb747becae061a2ea44f774b12c9cce5aeac102880]][.|https://github.com/apache/spark/pull/44635/files#diff-c139f61eabcfcb9725c8caeb747becae061a2ea44f774b12c9cce5aeac102880] The job hangs once tries to read avro files with no records. It loops forever here: {code:java} org.apache.spark.sql.avro.AvroUtils$RowReader.hasNextRow(AvroUtils.scala:265) org.apache.spark.sql.avro.AvroUtils$RowReader.hasNextRow$(AvroUtils.scala:263) org.apache.spark.sql.avro.AvroFileFormat$$anon$1.hasNextRow(AvroFileFormat.scala:186) org.apache.spark.sql.avro.AvroFileFormat$$anon$1.hasNext(AvroFileFormat.scala:201) scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:604) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:798) org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.$anonfun$hasNext$1(FileScanRDD.scala:506) org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$Lambda$1269/1728924771.apply$mcZ$sp(Unknown Source) {code} Here is the sample to reproduce the issue: [^second=02.avro] > Regression: Unable to load empty avro files emitted by event-hubs > - > > Key: SPARK-46990 > URL: https://issues.apache.org/jira/browse/SPARK-46990 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.5.0 > Environment: Databricks 14.0 - 14.3 (spark 3.5.0) >Reporter: Kamil Kandzia >Priority: Major > Attachments: second=02.avro > > > In azure, I use databricks and event-hubs. Up until spark version 3.4.1 (in > databricks as 13.3 LTS) empty avro files emitted by event-hubs can be read. > Since version 3.5.0, it is impossible to load these files (even if I have > multiple avro files to load and one of them is empty, it can't perform an > operation like count or save). I tested this on databricks versions 14.0, > 14.1, 14.2, 14.3 and it doesn't work properly in any of them. > I use the following code: > > {code:java} > df = spark.read.format("avro") \ > .load('abfss://@.dfs.core.windows.net///0/2024/02/05/22/46/10.avro') > > df.count() <- in this operation the spark hangs{code} > I am sending a fragment of logs from databricks and query plan: > {code:java} > 24/02/06 10:03:10 INFO ProgressReporter$: Added result fetcher for > 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9 > 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 1; threshold: 32 > 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and > directories. Siz
[jira] [Commented] (SPARK-46990) Regression: Unable to load empty avro files emitted by event-hubs
[ https://issues.apache.org/jira/browse/SPARK-46990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828090#comment-17828090 ] Pavlo Pohrrebnyi commented on SPARK-46990: -- We are experiencing the same with Spark 3.5. That is likely caused by [SPARK-46633|https://issues.apache.org/jira/browse/SPARK-46633]. Here is the change [PR-[44635|https://github.com/apache/spark/pull/44635/files#diff-c139f61eabcfcb9725c8caeb747becae061a2ea44f774b12c9cce5aeac102880]|[https://github.com/apache/spark/pull/44635/files#diff-c139f61eabcfcb9725c8caeb747becae061a2ea44f774b12c9cce5aeac102880]][.|https://github.com/apache/spark/pull/44635/files#diff-c139f61eabcfcb9725c8caeb747becae061a2ea44f774b12c9cce5aeac102880] The job hangs once tries to read avro files with no records. It loops forever here: {code:java} org.apache.spark.sql.avro.AvroUtils$RowReader.hasNextRow(AvroUtils.scala:265) org.apache.spark.sql.avro.AvroUtils$RowReader.hasNextRow$(AvroUtils.scala:263) org.apache.spark.sql.avro.AvroFileFormat$$anon$1.hasNextRow(AvroFileFormat.scala:186) org.apache.spark.sql.avro.AvroFileFormat$$anon$1.hasNext(AvroFileFormat.scala:201) scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:604) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:798) org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.$anonfun$hasNext$1(FileScanRDD.scala:506) org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$Lambda$1269/1728924771.apply$mcZ$sp(Unknown Source) {code} Here is the sample to reproduce the issue: [^second=02.avro] > Regression: Unable to load empty avro files emitted by event-hubs > - > > Key: SPARK-46990 > URL: https://issues.apache.org/jira/browse/SPARK-46990 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.5.0 > Environment: Databricks 14.0 - 14.3 (spark 3.5.0) >Reporter: Kamil Kandzia >Priority: Major > Attachments: second=02.avro > > > In azure, I use databricks and event-hubs. Up until spark version 3.4.1 (in > databricks as 13.3 LTS) empty avro files emitted by event-hubs can be read. > Since version 3.5.0, it is impossible to load these files (even if I have > multiple avro files to load and one of them is empty, it can't perform an > operation like count or save). I tested this on databricks versions 14.0, > 14.1, 14.2, 14.3 and it doesn't work properly in any of them. > I use the following code: > > {code:java} > df = spark.read.format("avro") \ > .load('abfss://@.dfs.core.windows.net///0/2024/02/05/22/46/10.avro') > > df.count() <- in this operation the spark hangs{code} > I am sending a fragment of logs from databricks and query plan: > {code:java} > 24/02/06 10:03:10 INFO ProgressReporter$: Added result fetcher for > 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9 > 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 1; threshold: 32 > 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 24/02/06 10:03:11 INFO InMemoryFileIndex: It took 9 ms to list leaf files for > 1 paths. > 24/02/06 10:03:11 INFO ProgressReporter$: Removed result fetcher for > 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9 > 24/02/06 10:03:12 INFO ProgressReporter$: Added result fetcher for > 2734305632140666820_6526693737104909881_a07acddb350f44a284cac52db0b2fb21 > 24/02/06 10:03:12 INFO ClusterLoadMonitor: Added query with execution ID:38. > Current active queries:1 > 24/02/06 10:03:12 INFO FileSourceStrategy: Pushed Filters: > 24/02/06 10:03:12 INFO FileSourceStrategy: Post-Scan Filters: > 24/02/06 10:03:12 INFO CodeGenerator: Code generated in 10.636308 ms > 24/02/06 10:03:12 INFO MemoryStore: Block broadcast_34 stored as values in > memory (estimated size 409.5 KiB, free 3.3 GiB) > 24/02/06 10:03:12 INFO MemoryStore: Block broadcast_34_piece0 stored as bytes > in memory (estimated size 14.5 KiB, free 3.3 GiB) > 24/02/06 10:03:12 INFO BlockManagerInfo: Added broadcast_34_piece0 in memory > on :43781 (size: 14.5 KiB, free: 3.3 GiB) > 24/02/06 10:03:12 INFO SparkContext: Created broadcast 34 from > $anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63 > 24/02/06 10:03:12 INFO FileSourceScanExec: Planning scan with bin packing, > max split size: 4194304 bytes, max partition size: 4194304, open cost is > considered as scanning 4194304 bytes. > 24/02/06 10:03:12 INFO DAGScheduler: Registering RDD 104
[jira] [Comment Edited] (SPARK-45374) [CORE] Add test keys for SSL functionality
[ https://issues.apache.org/jira/browse/SPARK-45374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828087#comment-17828087 ] Mridul Muralidharan edited comment on SPARK-45374 at 3/18/24 8:05 PM: -- Missed your query, you can link by: "more' -> link -> Web Link -> * URL == pr url * Link test == "GitHub Pull Request #" I did it for this PR was (Author: mridulm80): Missed your query, you can link by: "more' -> link -> Web Link -> * URL == pr url * Link test == "GitHub Pull Request #" > [CORE] Add test keys for SSL functionality > -- > > Key: SPARK-45374 > URL: https://issues.apache.org/jira/browse/SPARK-45374 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Hasnain Lakhani >Priority: Major > > Add test SSL keys which will be used for unit and integration tests of the > new SSL RPC functionality -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46990) Regression: Unable to load empty avro files emitted by event-hubs
[ https://issues.apache.org/jira/browse/SPARK-46990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pavlo Pohrrebnyi updated SPARK-46990: - Attachment: second=02.avro > Regression: Unable to load empty avro files emitted by event-hubs > - > > Key: SPARK-46990 > URL: https://issues.apache.org/jira/browse/SPARK-46990 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.5.0 > Environment: Databricks 14.0 - 14.3 (spark 3.5.0) >Reporter: Kamil Kandzia >Priority: Major > Attachments: second=02.avro > > > In azure, I use databricks and event-hubs. Up until spark version 3.4.1 (in > databricks as 13.3 LTS) empty avro files emitted by event-hubs can be read. > Since version 3.5.0, it is impossible to load these files (even if I have > multiple avro files to load and one of them is empty, it can't perform an > operation like count or save). I tested this on databricks versions 14.0, > 14.1, 14.2, 14.3 and it doesn't work properly in any of them. > I use the following code: > > {code:java} > df = spark.read.format("avro") \ > .load('abfss://@.dfs.core.windows.net///0/2024/02/05/22/46/10.avro') > > df.count() <- in this operation the spark hangs{code} > I am sending a fragment of logs from databricks and query plan: > {code:java} > 24/02/06 10:03:10 INFO ProgressReporter$: Added result fetcher for > 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9 > 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 1; threshold: 32 > 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 24/02/06 10:03:11 INFO InMemoryFileIndex: It took 9 ms to list leaf files for > 1 paths. > 24/02/06 10:03:11 INFO ProgressReporter$: Removed result fetcher for > 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9 > 24/02/06 10:03:12 INFO ProgressReporter$: Added result fetcher for > 2734305632140666820_6526693737104909881_a07acddb350f44a284cac52db0b2fb21 > 24/02/06 10:03:12 INFO ClusterLoadMonitor: Added query with execution ID:38. > Current active queries:1 > 24/02/06 10:03:12 INFO FileSourceStrategy: Pushed Filters: > 24/02/06 10:03:12 INFO FileSourceStrategy: Post-Scan Filters: > 24/02/06 10:03:12 INFO CodeGenerator: Code generated in 10.636308 ms > 24/02/06 10:03:12 INFO MemoryStore: Block broadcast_34 stored as values in > memory (estimated size 409.5 KiB, free 3.3 GiB) > 24/02/06 10:03:12 INFO MemoryStore: Block broadcast_34_piece0 stored as bytes > in memory (estimated size 14.5 KiB, free 3.3 GiB) > 24/02/06 10:03:12 INFO BlockManagerInfo: Added broadcast_34_piece0 in memory > on :43781 (size: 14.5 KiB, free: 3.3 GiB) > 24/02/06 10:03:12 INFO SparkContext: Created broadcast 34 from > $anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63 > 24/02/06 10:03:12 INFO FileSourceScanExec: Planning scan with bin packing, > max split size: 4194304 bytes, max partition size: 4194304, open cost is > considered as scanning 4194304 bytes. > 24/02/06 10:03:12 INFO DAGScheduler: Registering RDD 104 > ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) as input > to shuffle 11 > 24/02/06 10:03:12 INFO DAGScheduler: Got map stage job 22 > ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) with 1 > output partitions > 24/02/06 10:03:12 INFO DAGScheduler: Final stage: ShuffleMapStage 31 > ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) > 24/02/06 10:03:12 INFO DAGScheduler: Parents of final stage: List() > 24/02/06 10:03:12 INFO DAGScheduler: Missing parents: List() > 24/02/06 10:03:12 INFO DAGScheduler: Submitting ShuffleMapStage 31 > (MapPartitionsRDD[104] at $anonfun$withThreadLocalCaptured$5 at > LexicalThreadLocal.scala:63), which has no missing parents > 24/02/06 10:03:12 INFO DAGScheduler: Submitting 1 missing tasks from > ShuffleMapStage 31 (MapPartitionsRDD[104] at > $anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) (first 15 > tasks are for partitions Vector(0)) > 24/02/06 10:03:12 INFO TaskSchedulerImpl: Adding task set 31.0 with 1 tasks > resource profile 0 > 24/02/06 10:03:12 INFO TaskSetManager: TaskSet 31.0 using PreferredLocationsV1 > 24/02/06 10:03:12 WARN FairSchedulableBuilder: A job was submitted with > scheduler pool 2734305632140666820, which has not been configured. This can > happen when the file that pools are read from isn't set, or when that file > doesn't contain 2734305632140666820. Created 2734305632140666820 with default > configuration (schedulingMode: FIFO, minShare: 0, weight: 1) > 24/02/06 10:03:12 INFO FairSchedulableBuilder: Added task set TaskSet_31.0 > tasks to pool
[jira] [Commented] (SPARK-45374) [CORE] Add test keys for SSL functionality
[ https://issues.apache.org/jira/browse/SPARK-45374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828087#comment-17828087 ] Mridul Muralidharan commented on SPARK-45374: - Missed your query, you can link by: "more' -> link -> Web Link -> * URL == pr url * Link test == "GitHub Pull Request #" > [CORE] Add test keys for SSL functionality > -- > > Key: SPARK-45374 > URL: https://issues.apache.org/jira/browse/SPARK-45374 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Hasnain Lakhani >Priority: Major > > Add test SSL keys which will be used for unit and integration tests of the > new SSL RPC functionality -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47446) Make `BlockManager` warn before `removeBlockInternal`
[ https://issues.apache.org/jira/browse/SPARK-47446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47446. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45570 [https://github.com/apache/spark/pull/45570] > Make `BlockManager` warn before `removeBlockInternal` > - > > Key: SPARK-47446 > URL: https://issues.apache.org/jira/browse/SPARK-47446 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > {code} > 24/03/18 18:40:46 WARN BlockManager: Putting block broadcast_0 failed due to > exception java.nio.file.NoSuchFileException: > /data/spark/blockmgr-56a6c418-90be-4d89-9707-ef45f7eaf74c/0e. > 24/03/18 18:40:46 WARN BlockManager: Block broadcast_0 was not removed > normally. > 24/03/18 18:40:46 INFO TaskSchedulerImpl: Cancelling stage 0 > 24/03/18 18:40:46 INFO TaskSchedulerImpl: Killing all running tasks in stage > 0: Stage cancelled > 24/03/18 18:40:46 INFO DAGScheduler: ResultStage 0 (reduce at > SparkPi.scala:38) failed in 0.264 s due to Job aborted due to stage failure: > Task serialization failed: java.nio.file.NoSuchFileException: > /data/spark/blockmgr-56a6c418-90be-4d89-9707-ef45f7eaf74c/0e > java.nio.file.NoSuchFileException: > /data/spark/blockmgr-56a6c418-90be-4d89-9707-ef45f7eaf74c/0e > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47446) Make `BlockManager` warn before `removeBlockInternal`
[ https://issues.apache.org/jira/browse/SPARK-47446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-47446: - Assignee: Dongjoon Hyun > Make `BlockManager` warn before `removeBlockInternal` > - > > Key: SPARK-47446 > URL: https://issues.apache.org/jira/browse/SPARK-47446 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > {code} > 24/03/18 18:40:46 WARN BlockManager: Putting block broadcast_0 failed due to > exception java.nio.file.NoSuchFileException: > /data/spark/blockmgr-56a6c418-90be-4d89-9707-ef45f7eaf74c/0e. > 24/03/18 18:40:46 WARN BlockManager: Block broadcast_0 was not removed > normally. > 24/03/18 18:40:46 INFO TaskSchedulerImpl: Cancelling stage 0 > 24/03/18 18:40:46 INFO TaskSchedulerImpl: Killing all running tasks in stage > 0: Stage cancelled > 24/03/18 18:40:46 INFO DAGScheduler: ResultStage 0 (reduce at > SparkPi.scala:38) failed in 0.264 s due to Job aborted due to stage failure: > Task serialization failed: java.nio.file.NoSuchFileException: > /data/spark/blockmgr-56a6c418-90be-4d89-9707-ef45f7eaf74c/0e > java.nio.file.NoSuchFileException: > /data/spark/blockmgr-56a6c418-90be-4d89-9707-ef45f7eaf74c/0e > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47447) Allow reading Parquet TimestampLTZ as TimestampNTZ
[ https://issues.apache.org/jira/browse/SPARK-47447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47447: --- Labels: pull-request-available (was: ) > Allow reading Parquet TimestampLTZ as TimestampNTZ > -- > > Key: SPARK-47447 > URL: https://issues.apache.org/jira/browse/SPARK-47447 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Labels: pull-request-available > > Currently, Parquet TimestampNTZ type columns can be read as TimestampLTZ, > while reading TimestampLTZ as TimestampNTZ will cause errors. This makes it > impossible to read parquet files containing both TimestampLTZ and > TimestampNTZ as TimestampNTZ. > To make the data type system on Parquet simpler, we should allow reading > TimestampLTZ as TimestampNTZ in the Parquet data source. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47447) Allow reading Parquet TimestampLTZ as TimestampNTZ
Gengliang Wang created SPARK-47447: -- Summary: Allow reading Parquet TimestampLTZ as TimestampNTZ Key: SPARK-47447 URL: https://issues.apache.org/jira/browse/SPARK-47447 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Gengliang Wang Assignee: Gengliang Wang Currently, Parquet TimestampNTZ type columns can be read as TimestampLTZ, while reading TimestampLTZ as TimestampNTZ will cause errors. This makes it impossible to read parquet files containing both TimestampLTZ and TimestampNTZ as TimestampNTZ. To make the data type system on Parquet simpler, we should allow reading TimestampLTZ as TimestampNTZ in the Parquet data source. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47446) Make `BlockManager` warn before `removeBlockInternal`
[ https://issues.apache.org/jira/browse/SPARK-47446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47446: --- Labels: pull-request-available (was: ) > Make `BlockManager` warn before `removeBlockInternal` > - > > Key: SPARK-47446 > URL: https://issues.apache.org/jira/browse/SPARK-47446 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > {code} > 24/03/18 18:40:46 WARN BlockManager: Putting block broadcast_0 failed due to > exception java.nio.file.NoSuchFileException: > /data/spark/blockmgr-56a6c418-90be-4d89-9707-ef45f7eaf74c/0e. > 24/03/18 18:40:46 WARN BlockManager: Block broadcast_0 was not removed > normally. > 24/03/18 18:40:46 INFO TaskSchedulerImpl: Cancelling stage 0 > 24/03/18 18:40:46 INFO TaskSchedulerImpl: Killing all running tasks in stage > 0: Stage cancelled > 24/03/18 18:40:46 INFO DAGScheduler: ResultStage 0 (reduce at > SparkPi.scala:38) failed in 0.264 s due to Job aborted due to stage failure: > Task serialization failed: java.nio.file.NoSuchFileException: > /data/spark/blockmgr-56a6c418-90be-4d89-9707-ef45f7eaf74c/0e > java.nio.file.NoSuchFileException: > /data/spark/blockmgr-56a6c418-90be-4d89-9707-ef45f7eaf74c/0e > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47446) Make `BlockManager` warn before `removeBlockInternal`
Dongjoon Hyun created SPARK-47446: - Summary: Make `BlockManager` warn before `removeBlockInternal` Key: SPARK-47446 URL: https://issues.apache.org/jira/browse/SPARK-47446 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 4.0.0 Reporter: Dongjoon Hyun {code} 24/03/18 18:40:46 WARN BlockManager: Putting block broadcast_0 failed due to exception java.nio.file.NoSuchFileException: /data/spark/blockmgr-56a6c418-90be-4d89-9707-ef45f7eaf74c/0e. 24/03/18 18:40:46 WARN BlockManager: Block broadcast_0 was not removed normally. 24/03/18 18:40:46 INFO TaskSchedulerImpl: Cancelling stage 0 24/03/18 18:40:46 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage cancelled 24/03/18 18:40:46 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) failed in 0.264 s due to Job aborted due to stage failure: Task serialization failed: java.nio.file.NoSuchFileException: /data/spark/blockmgr-56a6c418-90be-4d89-9707-ef45f7eaf74c/0e java.nio.file.NoSuchFileException: /data/spark/blockmgr-56a6c418-90be-4d89-9707-ef45f7eaf74c/0e {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47383) Support `spark.shutdown.timeout` config
[ https://issues.apache.org/jira/browse/SPARK-47383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-47383: -- Summary: Support `spark.shutdown.timeout` config (was: Make the shutdown hook timeout configurable) > Support `spark.shutdown.timeout` config > --- > > Key: SPARK-47383 > URL: https://issues.apache.org/jira/browse/SPARK-47383 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Rob Reeves >Assignee: Rob Reeves >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > org.apache.spark.util.ShutdownHookManager is used to register custom shutdown > operations. This is not easily configurable. The underlying > org.apache.hadoop.util.ShutdownHookManager has a default timeout of 30 > seconds. It can be configured by setting hadoop.service.shutdown.timeout, > but this must be done in the core-site.xml/core-default.xml because a new > hadoop conf object is created and there is no opportunity to modify it. > org.apache.hadoop.util.ShutdownHookManager provides an overload to pass a > custom timeout. Spark should use that and allow a user defined timeout to be > used. > This is useful because we see timeouts during shutdown and want to give some > extra time for the event queues to drain to avoid log data loss. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47383) Make the shutdown hook timeout configurable
[ https://issues.apache.org/jira/browse/SPARK-47383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47383. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45504 [https://github.com/apache/spark/pull/45504] > Make the shutdown hook timeout configurable > --- > > Key: SPARK-47383 > URL: https://issues.apache.org/jira/browse/SPARK-47383 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Rob Reeves >Assignee: Rob Reeves >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > org.apache.spark.util.ShutdownHookManager is used to register custom shutdown > operations. This is not easily configurable. The underlying > org.apache.hadoop.util.ShutdownHookManager has a default timeout of 30 > seconds. It can be configured by setting hadoop.service.shutdown.timeout, > but this must be done in the core-site.xml/core-default.xml because a new > hadoop conf object is created and there is no opportunity to modify it. > org.apache.hadoop.util.ShutdownHookManager provides an overload to pass a > custom timeout. Spark should use that and allow a user defined timeout to be > used. > This is useful because we see timeouts during shutdown and want to give some > extra time for the event queues to drain to avoid log data loss. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47383) Make the shutdown hook timeout configurable
[ https://issues.apache.org/jira/browse/SPARK-47383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-47383: - Assignee: Rob Reeves > Make the shutdown hook timeout configurable > --- > > Key: SPARK-47383 > URL: https://issues.apache.org/jira/browse/SPARK-47383 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Rob Reeves >Assignee: Rob Reeves >Priority: Minor > Labels: pull-request-available > > org.apache.spark.util.ShutdownHookManager is used to register custom shutdown > operations. This is not easily configurable. The underlying > org.apache.hadoop.util.ShutdownHookManager has a default timeout of 30 > seconds. It can be configured by setting hadoop.service.shutdown.timeout, > but this must be done in the core-site.xml/core-default.xml because a new > hadoop conf object is created and there is no opportunity to modify it. > org.apache.hadoop.util.ShutdownHookManager provides an overload to pass a > custom timeout. Spark should use that and allow a user defined timeout to be > used. > This is useful because we see timeouts during shutdown and want to give some > extra time for the event queues to drain to avoid log data loss. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47445) Adding new 'Silent' ExplainMode
Victor Sunderland created SPARK-47445: - Summary: Adding new 'Silent' ExplainMode Key: SPARK-47445 URL: https://issues.apache.org/jira/browse/SPARK-47445 Project: Spark Issue Type: Improvement Components: Connect, Documentation, PySpark, SQL Affects Versions: 4.0.0 Reporter: Victor Sunderland While investigating unit test duration we found that org.apache.spark.sql.execution.QueryExecution.explainString () takes approximately 14% time. This method generates the string representation of the execution plan. The string is often used for logging purposes. This is also called for each AQE job so it can save prod execution time too. While SPARK-44485 does exist to help optimize the prod execution time, the main purpose of this PR is to save time during unit testing. I've added a silent mode to ExplainMode to try and mitigate this issue. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47444) Empty numRows table stats should not break Hive tables
[ https://issues.apache.org/jira/browse/SPARK-47444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Miklos Szurap updated SPARK-47444: -- Attachment: reproduction_steps_SPARK-47444.txt > Empty numRows table stats should not break Hive tables > -- > > Key: SPARK-47444 > URL: https://issues.apache.org/jira/browse/SPARK-47444 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.8 >Reporter: Miklos Szurap >Priority: Major > Labels: Hive, HiveMetaStoreClient, SQL > Attachments: reproduction_steps_SPARK-47444.txt > > > A Hive table cannot be accessed / queried / updated from Spark (it is > completely "broken") if the "numRows" table property (table stat) is > populated with a non-numeric value (like an empty string). Accessing the able > from spark results in a "NumberFormatException": > {code} > scala> spark.sql("select * from t1p").show() > java.lang.NumberFormatException: Zero length BigInteger > at java.math.BigInteger.(BigInteger.java:420) > ... > at > org.apache.spark.sql.hive.client.HiveClientImpl$.org$apache$spark$sql$hive$client$HiveClientImpl$$readHiveStats(HiveClientImpl.scala:1243) > ... > at > org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:91) > ... > {code} > or > similarly just with > {code} > java.lang.NumberFormatException: For input string: "Foo" > {code} > Currently the table stats can be broken through Spark with > {code} > scala> spark.sql("alter table t1p set tblproperties('numRows'='', > 'STATS_GENERATED_VIA_STATS_TASK'='true')").show() > {code} > > Spark should: > 1. Validate sparkSQL "alter table" statements and not allow non-numeric > values in the "totalSize", "numRows", "rawDataSize" table properties, as > those are checked in the > [HiveClientImpl#readHiveStats()|https://github.com/apache/spark/blob/1aafe60b3e7633f755499f5394ca62289e42588d/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L1260C15-L1260C28] > 2. The HiveClientImpl#readHiveStats should probably tolerate these wrong > "totalSize", "numRows", "rawDataSize" table properties and not fail with a > cryptic NumberFormatException, but treat those as zero. Or at least it should > provide a clue in the error message which table property is incorrect. > Note: beeline/Hive validates alter table statements, however Impala can > similarly break the table, the above item #1 needs to be fixed there too. > I have checked only the Spark 2.4.x behavior, the same probably exists in > Spark 3.x too. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47444) Empty numRows table stats should not break Hive tables
Miklos Szurap created SPARK-47444: - Summary: Empty numRows table stats should not break Hive tables Key: SPARK-47444 URL: https://issues.apache.org/jira/browse/SPARK-47444 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.8 Reporter: Miklos Szurap A Hive table cannot be accessed / queried / updated from Spark (it is completely "broken") if the "numRows" table property (table stat) is populated with a non-numeric value (like an empty string). Accessing the able from spark results in a "NumberFormatException": {code} scala> spark.sql("select * from t1p").show() java.lang.NumberFormatException: Zero length BigInteger at java.math.BigInteger.(BigInteger.java:420) ... at org.apache.spark.sql.hive.client.HiveClientImpl$.org$apache$spark$sql$hive$client$HiveClientImpl$$readHiveStats(HiveClientImpl.scala:1243) ... at org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:91) ... {code} or similarly just with {code} java.lang.NumberFormatException: For input string: "Foo" {code} Currently the table stats can be broken through Spark with {code} scala> spark.sql("alter table t1p set tblproperties('numRows'='', 'STATS_GENERATED_VIA_STATS_TASK'='true')").show() {code} Spark should: 1. Validate sparkSQL "alter table" statements and not allow non-numeric values in the "totalSize", "numRows", "rawDataSize" table properties, as those are checked in the [HiveClientImpl#readHiveStats()|https://github.com/apache/spark/blob/1aafe60b3e7633f755499f5394ca62289e42588d/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L1260C15-L1260C28] 2. The HiveClientImpl#readHiveStats should probably tolerate these wrong "totalSize", "numRows", "rawDataSize" table properties and not fail with a cryptic NumberFormatException, but treat those as zero. Or at least it should provide a clue in the error message which table property is incorrect. Note: beeline/Hive validates alter table statements, however Impala can similarly break the table, the above item #1 needs to be fixed there too. I have checked only the Spark 2.4.x behavior, the same probably exists in Spark 3.x too. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47442) Use port 0 to start worker server in MasterSuite
[ https://issues.apache.org/jira/browse/SPARK-47442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47442. --- Fix Version/s: 4.0.0 Assignee: wuyi Resolution: Fixed > Use port 0 to start worker server in MasterSuite > > > Key: SPARK-47442 > URL: https://issues.apache.org/jira/browse/SPARK-47442 > Project: Spark > Issue Type: Test > Components: Spark Core, Tests >Affects Versions: 4.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47435) SPARK-45561 causes mysql unsigned tinyint overflow
[ https://issues.apache.org/jira/browse/SPARK-47435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47435. --- Fix Version/s: 4.0.0 Assignee: Kent Yao Resolution: Fixed > SPARK-45561 causes mysql unsigned tinyint overflow > -- > > Key: SPARK-47435 > URL: https://issues.apache.org/jira/browse/SPARK-47435 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47438) Upgrade jackson to 2.17.0
[ https://issues.apache.org/jira/browse/SPARK-47438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47438. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45562 [https://github.com/apache/spark/pull/45562] > Upgrade jackson to 2.17.0 > - > > Key: SPARK-47438 > URL: https://issues.apache.org/jira/browse/SPARK-47438 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47443) Window aggregate support
[ https://issues.apache.org/jira/browse/SPARK-47443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47443: --- Labels: pull-request-available (was: ) > Window aggregate support > > > Key: SPARK-47443 > URL: https://issues.apache.org/jira/browse/SPARK-47443 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Aleksandar Tomic >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47443) Window aggregate support
Aleksandar Tomic created SPARK-47443: Summary: Window aggregate support Key: SPARK-47443 URL: https://issues.apache.org/jira/browse/SPARK-47443 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 4.0.0 Reporter: Aleksandar Tomic -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47442) Use port 0 to start worker server in MasterSuite
[ https://issues.apache.org/jira/browse/SPARK-47442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-47442: -- Reporter: wuyi (was: Dongjoon Hyun) > Use port 0 to start worker server in MasterSuite > > > Key: SPARK-47442 > URL: https://issues.apache.org/jira/browse/SPARK-47442 > Project: Spark > Issue Type: Test > Components: Spark Core, Tests >Affects Versions: 4.0.0 >Reporter: wuyi >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47434) Streaming Statistics link redirect causing 302 error
[ https://issues.apache.org/jira/browse/SPARK-47434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-47434: - Assignee: Huw > Streaming Statistics link redirect causing 302 error > > > Key: SPARK-47434 > URL: https://issues.apache.org/jira/browse/SPARK-47434 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.4.1, 3.5.1 >Reporter: Huw >Assignee: Huw >Priority: Minor > Labels: pull-request-available > Fix For: 3.5.2 > > > When using a reverse proxy, links to streaming statistics page are missing a > trailing slash, which causes a redirect (302) to an incorrect path. > Essentially the same issue as > https://issues.apache.org/jira/browse/SPARK-24553 but for a different link. > .../StreamingQuery/statistics?id=abcd -> > .../StreamingQuery/statistics/?id=abcd > Linked PR [https://github.com/apache/spark/pull/45527/files] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47434) Streaming Statistics link redirect causing 302 error
[ https://issues.apache.org/jira/browse/SPARK-47434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47434. --- Fix Version/s: 3.4.3 4.0.0 Resolution: Fixed Issue resolved by pull request 45527 [https://github.com/apache/spark/pull/45527] > Streaming Statistics link redirect causing 302 error > > > Key: SPARK-47434 > URL: https://issues.apache.org/jira/browse/SPARK-47434 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.4.1, 3.5.1 >Reporter: Huw >Assignee: Huw >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.3, 3.5.2, 4.0.0 > > > When using a reverse proxy, links to streaming statistics page are missing a > trailing slash, which causes a redirect (302) to an incorrect path. > Essentially the same issue as > https://issues.apache.org/jira/browse/SPARK-24553 but for a different link. > .../StreamingQuery/statistics?id=abcd -> > .../StreamingQuery/statistics/?id=abcd > Linked PR [https://github.com/apache/spark/pull/45527/files] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47442) Use port 0 to start worker server in MasterSuite
[ https://issues.apache.org/jira/browse/SPARK-47442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47442: --- Labels: pull-request-available (was: ) > Use port 0 to start worker server in MasterSuite > > > Key: SPARK-47442 > URL: https://issues.apache.org/jira/browse/SPARK-47442 > Project: Spark > Issue Type: Test > Components: Spark Core, Tests >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47442) Use port 0 to start worker server in MasterSuite
Dongjoon Hyun created SPARK-47442: - Summary: Use port 0 to start worker server in MasterSuite Key: SPARK-47442 URL: https://issues.apache.org/jira/browse/SPARK-47442 Project: Spark Issue Type: Test Components: Spark Core, Tests Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47438) Upgrade jackson to 2.17.0
[ https://issues.apache.org/jira/browse/SPARK-47438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-47438: -- Parent: SPARK-47046 Issue Type: Sub-task (was: Improvement) > Upgrade jackson to 2.17.0 > - > > Key: SPARK-47438 > URL: https://issues.apache.org/jira/browse/SPARK-47438 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47441) Do not add log link for unmanaged AM in Spark UI
[ https://issues.apache.org/jira/browse/SPARK-47441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47441: --- Labels: pull-request-available (was: ) > Do not add log link for unmanaged AM in Spark UI > > > Key: SPARK-47441 > URL: https://issues.apache.org/jira/browse/SPARK-47441 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 3.5.0, 3.5.1 >Reporter: Yuming Wang >Priority: Major > Labels: pull-request-available > > {noformat} > 24/03/18 04:58:25,022 ERROR [spark-listener-group-appStatus] > scheduler.AsyncEventQueue:97 : Listener AppStatusListener threw an exception > java.lang.NumberFormatException: For input string: "null" > at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:67) > ~[?:?] > at java.lang.Integer.parseInt(Integer.java:668) ~[?:?] > at java.lang.Integer.parseInt(Integer.java:786) ~[?:?] > at scala.collection.immutable.StringLike.toInt(StringLike.scala:310) > ~[scala-library-2.12.18.jar:?] > at scala.collection.immutable.StringLike.toInt$(StringLike.scala:310) > ~[scala-library-2.12.18.jar:?] > at scala.collection.immutable.StringOps.toInt(StringOps.scala:33) > ~[scala-library-2.12.18.jar:?] > at org.apache.spark.util.Utils$.parseHostPort(Utils.scala:1105) > ~[spark-core_2.12-3.5.1.jar:3.5.1] > at > org.apache.spark.status.ProcessSummaryWrapper.(storeTypes.scala:609) > ~[spark-core_2.12-3.5.1.jar:3.5.1] > at > org.apache.spark.status.LiveMiscellaneousProcess.doUpdate(LiveEntity.scala:1045) > ~[spark-core_2.12-3.5.1.jar:3.5.1] > at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:50) > ~[spark-core_2.12-3.5.1.jar:3.5.1] > at > org.apache.spark.status.AppStatusListener.update(AppStatusListener.scala:1233) > ~[spark-core_2.12-3.5.1.jar:3.5.1] > at > org.apache.spark.status.AppStatusListener.onMiscellaneousProcessAdded(AppStatusListener.scala:1445) > ~[spark-core_2.12-3.5.1.jar:3.5.1] > at > org.apache.spark.status.AppStatusListener.onOtherEvent(AppStatusListener.scala:113) > ~[spark-core_2.12-3.5.1.jar:3.5.1] > at > org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:100) > ~[spark-core_2.12-3.5.1.jar:3.5.1] > at > org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) > ~[spark-core_2.12-3.5.1.jar:3.5.1] > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > ~[spark-core_2.12-3.5.1.jar:3.5.1] > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > ~[spark-core_2.12-3.5.1.jar:3.5.1] > at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) > ~[spark-core_2.12-3.5.1.jar:3.5.1] > at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) > ~[spark-core_2.12-3.5.1.jar:3.5.1] > at > org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) > ~[spark-core_2.12-3.5.1.jar:3.5.1] > at > org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) > ~[spark-core_2.12-3.5.1.jar:3.5.1] > at > scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) > ~[scala-library-2.12.18.jar:?] > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) > ~[scala-library-2.12.18.jar:?] > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) > ~[spark-core_2.12-3.5.1.jar:3.5.1] > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) > ~[spark-core_2.12-3.5.1.jar:3.5.1] > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1356) > [spark-core_2.12-3.5.1.jar:3.5.1] > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) > [spark-core_2.12-3.5.1.jar:3.5.1] > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47441) Do not add log link for unmanaged AM in Spark UI
Yuming Wang created SPARK-47441: --- Summary: Do not add log link for unmanaged AM in Spark UI Key: SPARK-47441 URL: https://issues.apache.org/jira/browse/SPARK-47441 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 3.5.1, 3.5.0 Reporter: Yuming Wang {noformat} 24/03/18 04:58:25,022 ERROR [spark-listener-group-appStatus] scheduler.AsyncEventQueue:97 : Listener AppStatusListener threw an exception java.lang.NumberFormatException: For input string: "null" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:67) ~[?:?] at java.lang.Integer.parseInt(Integer.java:668) ~[?:?] at java.lang.Integer.parseInt(Integer.java:786) ~[?:?] at scala.collection.immutable.StringLike.toInt(StringLike.scala:310) ~[scala-library-2.12.18.jar:?] at scala.collection.immutable.StringLike.toInt$(StringLike.scala:310) ~[scala-library-2.12.18.jar:?] at scala.collection.immutable.StringOps.toInt(StringOps.scala:33) ~[scala-library-2.12.18.jar:?] at org.apache.spark.util.Utils$.parseHostPort(Utils.scala:1105) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.status.ProcessSummaryWrapper.(storeTypes.scala:609) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.status.LiveMiscellaneousProcess.doUpdate(LiveEntity.scala:1045) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:50) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.status.AppStatusListener.update(AppStatusListener.scala:1233) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.status.AppStatusListener.onMiscellaneousProcessAdded(AppStatusListener.scala:1445) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.status.AppStatusListener.onOtherEvent(AppStatusListener.scala:113) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:100) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) ~[spark-core_2.12-3.5.1.jar:3.5.1] at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) ~[scala-library-2.12.18.jar:?] at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) ~[scala-library-2.12.18.jar:?] at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1356) [spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) [spark-core_2.12-3.5.1.jar:3.5.1] {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47440) SQLServer does not support LIKE operator in binary comparison
[ https://issues.apache.org/jira/browse/SPARK-47440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47440: --- Labels: pull-request-available (was: ) > SQLServer does not support LIKE operator in binary comparison > - > > Key: SPARK-47440 > URL: https://issues.apache.org/jira/browse/SPARK-47440 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Stefan Bukorovic >Priority: Major > Labels: pull-request-available > > When pushing Spark query to MsSqlServer engine we sometimes construct SQL > query that has a LIKE operator as a part of the binary comparison operation, > which is not permitted in SQL Server syntax. > For example a query > {code:java} > SELECT * FROM people WHERE (name LIKE "s%") = 1{code} > will not execute on MsSQLServer. > These queries should be detected and not pushed to execution in MsSqlServer. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47440) SQLServer does not support LIKE operator in binary comparison
Stefan Bukorovic created SPARK-47440: Summary: SQLServer does not support LIKE operator in binary comparison Key: SPARK-47440 URL: https://issues.apache.org/jira/browse/SPARK-47440 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 4.0.0 Reporter: Stefan Bukorovic When pushing Spark query to MsSqlServer engine we sometimes construct SQL query that has a LIKE operator as a part of the binary comparison operation, which is not permitted in SQL Server syntax. For example a query {code:java} SELECT * FROM people WHERE (name LIKE "s%") = 1{code} will not execute on MsSQLServer. These queries should be detected and not pushed to execution in MsSqlServer. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47422) Support collated strings in array operations
[ https://issues.apache.org/jira/browse/SPARK-47422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47422: --- Labels: pull-request-available (was: ) > Support collated strings in array operations > > > Key: SPARK-47422 > URL: https://issues.apache.org/jira/browse/SPARK-47422 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Nikola Mandic >Priority: Major > Labels: pull-request-available > > Collations need to be properly supported in following array operations but > currently yield unexpected results: ArraysOverlap, ArrayDistinct, ArrayUnion, > ArrayIntersect, ArrayExcept. Example query: > {code:java} > select array_contains(array('aaa' collate utf8_binary_lcase), 'AAA' collate > utf8_binary_lcase){code} > We would expect the result of query to be true. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47438) Upgrade jackson to 2.17.0
[ https://issues.apache.org/jira/browse/SPARK-47438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47438: --- Labels: pull-request-available (was: ) > Upgrade jackson to 2.17.0 > - > > Key: SPARK-47438 > URL: https://issues.apache.org/jira/browse/SPARK-47438 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47437) Correct the error class for `DataFrame.sort`
[ https://issues.apache.org/jira/browse/SPARK-47437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-47437. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45559 [https://github.com/apache/spark/pull/45559] > Correct the error class for `DataFrame.sort` > > > Key: SPARK-47437 > URL: https://issues.apache.org/jira/browse/SPARK-47437 > Project: Spark > Issue Type: Bug > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47437) Correct the error class for `DataFrame.sort`
[ https://issues.apache.org/jira/browse/SPARK-47437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-47437: Assignee: Ruifeng Zheng > Correct the error class for `DataFrame.sort` > > > Key: SPARK-47437 > URL: https://issues.apache.org/jira/browse/SPARK-47437 > Project: Spark > Issue Type: Bug > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43435) re-enable doctest `pyspark.sql.connect.dataframe.DataFrame.writeStream`
[ https://issues.apache.org/jira/browse/SPARK-43435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-43435. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45560 [https://github.com/apache/spark/pull/45560] > re-enable doctest `pyspark.sql.connect.dataframe.DataFrame.writeStream` > --- > > Key: SPARK-43435 > URL: https://issues.apache.org/jira/browse/SPARK-43435 > Project: Spark > Issue Type: Test > Components: Connect, Tests >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47439) Document Python Data Source API in API reference page
[ https://issues.apache.org/jira/browse/SPARK-47439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-47439. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45561 [https://github.com/apache/spark/pull/45561] > Document Python Data Source API in API reference page > - > > Key: SPARK-47439 > URL: https://issues.apache.org/jira/browse/SPARK-47439 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47439) Document Python Data Source API in API reference page
[ https://issues.apache.org/jira/browse/SPARK-47439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-47439: Assignee: Hyukjin Kwon > Document Python Data Source API in API reference page > - > > Key: SPARK-47439 > URL: https://issues.apache.org/jira/browse/SPARK-47439 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47439) Document Python Data Source API in API reference page
[ https://issues.apache.org/jira/browse/SPARK-47439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47439: -- Assignee: Apache Spark > Document Python Data Source API in API reference page > - > > Key: SPARK-47439 > URL: https://issues.apache.org/jira/browse/SPARK-47439 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47439) Document Python Data Source API in API reference page
[ https://issues.apache.org/jira/browse/SPARK-47439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47439: -- Assignee: (was: Apache Spark) > Document Python Data Source API in API reference page > - > > Key: SPARK-47439 > URL: https://issues.apache.org/jira/browse/SPARK-47439 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47439) Document Python Data Source API in API reference page
[ https://issues.apache.org/jira/browse/SPARK-47439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47439: -- Assignee: Apache Spark > Document Python Data Source API in API reference page > - > > Key: SPARK-47439 > URL: https://issues.apache.org/jira/browse/SPARK-47439 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47439) Document Python Data Source API in API reference page
[ https://issues.apache.org/jira/browse/SPARK-47439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47439: -- Assignee: (was: Apache Spark) > Document Python Data Source API in API reference page > - > > Key: SPARK-47439 > URL: https://issues.apache.org/jira/browse/SPARK-47439 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47439) Document Python Data Source API in API reference page
[ https://issues.apache.org/jira/browse/SPARK-47439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47439: --- Labels: pull-request-available (was: ) > Document Python Data Source API in API reference page > - > > Key: SPARK-47439 > URL: https://issues.apache.org/jira/browse/SPARK-47439 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43435) re-enable doctest `pyspark.sql.connect.dataframe.DataFrame.writeStream`
[ https://issues.apache.org/jira/browse/SPARK-43435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-43435: --- Labels: pull-request-available (was: ) > re-enable doctest `pyspark.sql.connect.dataframe.DataFrame.writeStream` > --- > > Key: SPARK-43435 > URL: https://issues.apache.org/jira/browse/SPARK-43435 > Project: Spark > Issue Type: Test > Components: Connect, Tests >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47439) Document Python Data Source API in API reference page
[ https://issues.apache.org/jira/browse/SPARK-47439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-47439: - Summary: Document Python Data Source API in API reference page (was: Document Python Data Source API) > Document Python Data Source API in API reference page > - > > Key: SPARK-47439 > URL: https://issues.apache.org/jira/browse/SPARK-47439 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47439) Document Python Data Source API
Hyukjin Kwon created SPARK-47439: Summary: Document Python Data Source API Key: SPARK-47439 URL: https://issues.apache.org/jira/browse/SPARK-47439 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 4.0.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org