date:20240318

[jira] [Updated] (SPARK-47458) Incorrect to calculate the concurrent task number

2024-03-18 Thread Bobby Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bobby Wang updated SPARK-47458:
---
Summary: Incorrect to calculate the concurrent task number  (was: Wrong to 
calculate the concurrent task number)

> Incorrect to calculate the concurrent task number
> -
>
> Key: SPARK-47458
> URL: https://issues.apache.org/jira/browse/SPARK-47458
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Bobby Wang
>Priority: Major
>
> The below test case failed,
>  
> {code:java}
> test("problem of calculating the maximum concurrent task") {
>   withTempDir { dir =>
> val discoveryScript = createTempScriptWithExpectedOutput(
>   dir, "gpuDiscoveryScript", """{"name": "gpu","addresses":["0", "1", 
> "2", "3"]}""")
> val conf = new SparkConf()
>   // Setup a local cluster which would only has one executor with 2 CPUs 
> and 1 GPU.
>   .setMaster("local-cluster[1, 6, 1024]")
>   .setAppName("test-cluster")
>   .set(WORKER_GPU_ID.amountConf, "4")
>   .set(WORKER_GPU_ID.discoveryScriptConf, discoveryScript)
>   .set(EXECUTOR_GPU_ID.amountConf, "4")
>   .set(TASK_GPU_ID.amountConf, "2")
>   // disable barrier stage retry to fail the application as soon as 
> possible
>   .set(BARRIER_MAX_CONCURRENT_TASKS_CHECK_MAX_FAILURES, 1)
> sc = new SparkContext(conf)
> TestUtils.waitUntilExecutorsUp(sc, 1, 6)
> // Setup a barrier stage which contains 2 tasks and each task requires 1 
> CPU and 1 GPU.
> // Therefore, the total resources requirement (2 CPUs and 2 GPUs) of this 
> barrier stage
> // can not be satisfied since the cluster only has 2 CPUs and 1 GPU in 
> total.
> assert(sc.parallelize(Range(1, 10), 2)
>   .barrier()
>   .mapPartitions { iter => iter }
>   .collect() sameElements Range(1, 10).toArray[Int])
>   }
> } {code}
> The error log
>  
>  
> [SPARK-24819]: Barrier execution mode does not allow run a barrier stage that 
> requires more slots than the total number of slots in the cluster currently. 
> Please init a new cluster with more resources(e.g. CPU, GPU) or repartition 
> the input RDD(s) to reduce the number of slots required to run this barrier 
> stage.
> org.apache.spark.scheduler.BarrierJobSlotsNumberCheckFailed: [SPARK-24819]: 
> Barrier execution mode does not allow run a barrier stage that requires more 
> slots than the total number of slots in the cluster currently. Please init a 
> new cluster with more resources(e.g. CPU, GPU) or repartition the input 
> RDD(s) to reduce the number of slots required to run this barrier stage.
> at 
> org.apache.spark.errors.SparkCoreErrors$.numPartitionsGreaterThanMaxNumConcurrentTasksError(SparkCoreErrors.scala:241)
> at 
> org.apache.spark.scheduler.DAGScheduler.checkBarrierStageWithNumSlots(DAGScheduler.scala:576)
> at 
> org.apache.spark.scheduler.DAGScheduler.createResultStage(DAGScheduler.scala:654)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1321)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3055)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3046)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3035)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47459) Cancel running stage if the result is empty relation

2024-03-18 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-47459:
---

 Summary: Cancel running stage if the result is empty relation
 Key: SPARK-47459
 URL: https://issues.apache.org/jira/browse/SPARK-47459
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.1
Reporter: Yuming Wang
 Attachments: task stack trace.png

How to reproduce:
bin/spark-sql --master yarn --conf spark.driver.host=10.211.174.53
{code:sql}
set spark.sql.adaptive.enabled=true;
select a from (select id as a, id as b, id as z from range(1)) t1
join (select id as c, id as d from range(2)) t2 on t1.a = t2.c
join (select id as e, id as f from range(3)) t3 on t2.d = t3.e
where z % 10 < 0
group by 1;
{code}






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47459) Cancel running stage if the result is empty relation

2024-03-18 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-47459:

Attachment: task stack trace.png

> Cancel running stage if the result is empty relation
> 
>
> Key: SPARK-47459
> URL: https://issues.apache.org/jira/browse/SPARK-47459
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Yuming Wang
>Priority: Major
> Attachments: task stack trace.png
>
>
> How to reproduce:
> bin/spark-sql --master yarn --conf spark.driver.host=10.211.174.53
> {code:sql}
> set spark.sql.adaptive.enabled=true;
> select a from (select id as a, id as b, id as z from range(1)) t1
> join (select id as c, id as d from range(2)) t2 on t1.a = t2.c
> join (select id as e, id as f from range(3)) t3 on t2.d = t3.e
> where z % 10 < 0
> group by 1;
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47458) Wrong to calculate the concurrent task number

2024-03-18 Thread Bobby Wang (Jira)

Bobby Wang created SPARK-47458:
--

 Summary: Wrong to calculate the concurrent task number
 Key: SPARK-47458
 URL: https://issues.apache.org/jira/browse/SPARK-47458
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Bobby Wang


The below test case failed,

 
{code:java}
test("problem of calculating the maximum concurrent task") {
  withTempDir { dir =>
val discoveryScript = createTempScriptWithExpectedOutput(
  dir, "gpuDiscoveryScript", """{"name": "gpu","addresses":["0", "1", "2", 
"3"]}""")

val conf = new SparkConf()
  // Setup a local cluster which would only has one executor with 2 CPUs 
and 1 GPU.
  .setMaster("local-cluster[1, 6, 1024]")
  .setAppName("test-cluster")
  .set(WORKER_GPU_ID.amountConf, "4")
  .set(WORKER_GPU_ID.discoveryScriptConf, discoveryScript)
  .set(EXECUTOR_GPU_ID.amountConf, "4")
  .set(TASK_GPU_ID.amountConf, "2")
  // disable barrier stage retry to fail the application as soon as possible
  .set(BARRIER_MAX_CONCURRENT_TASKS_CHECK_MAX_FAILURES, 1)
sc = new SparkContext(conf)
TestUtils.waitUntilExecutorsUp(sc, 1, 6)

// Setup a barrier stage which contains 2 tasks and each task requires 1 
CPU and 1 GPU.
// Therefore, the total resources requirement (2 CPUs and 2 GPUs) of this 
barrier stage
// can not be satisfied since the cluster only has 2 CPUs and 1 GPU in 
total.
assert(sc.parallelize(Range(1, 10), 2)
  .barrier()
  .mapPartitions { iter => iter }
  .collect() sameElements Range(1, 10).toArray[Int])
  }
} {code}
The error log

 
 
[SPARK-24819]: Barrier execution mode does not allow run a barrier stage that 
requires more slots than the total number of slots in the cluster currently. 
Please init a new cluster with more resources(e.g. CPU, GPU) or repartition the 
input RDD(s) to reduce the number of slots required to run this barrier stage.
org.apache.spark.scheduler.BarrierJobSlotsNumberCheckFailed: [SPARK-24819]: 
Barrier execution mode does not allow run a barrier stage that requires more 
slots than the total number of slots in the cluster currently. Please init a 
new cluster with more resources(e.g. CPU, GPU) or repartition the input RDD(s) 
to reduce the number of slots required to run this barrier stage.
at 
org.apache.spark.errors.SparkCoreErrors$.numPartitionsGreaterThanMaxNumConcurrentTasksError(SparkCoreErrors.scala:241)
at 
org.apache.spark.scheduler.DAGScheduler.checkBarrierStageWithNumSlots(DAGScheduler.scala:576)
at 
org.apache.spark.scheduler.DAGScheduler.createResultStage(DAGScheduler.scala:654)
at 
org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1321)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3055)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3046)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3035)
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47457) Fix `IsolatedClientLoader.supportsHadoopShadedClient` to handle Hadoop 3.4+

2024-03-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47457:
--
Summary: Fix `IsolatedClientLoader.supportsHadoopShadedClient` to handle 
Hadoop 3.4+  (was: Fix `IsolatedClientLoader.supportsHadoopShadedClient` to 
handle Hadoop 3.4)

> Fix `IsolatedClientLoader.supportsHadoopShadedClient` to handle Hadoop 3.4+
> ---
>
> Key: SPARK-47457
> URL: https://issues.apache.org/jira/browse/SPARK-47457
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47457) Fix `IsolatedClientLoader.supportsHadoopShadedClient` to handle Hadoop 3.4

2024-03-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47457:
-

Assignee: Dongjoon Hyun

> Fix `IsolatedClientLoader.supportsHadoopShadedClient` to handle Hadoop 3.4
> --
>
> Key: SPARK-47457
> URL: https://issues.apache.org/jira/browse/SPARK-47457
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47457) Fix `IsolatedClientLoader.supportsHadoopShadedClient` to handle Hadoop 3.4

2024-03-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47457:
--
Component/s: SQL
 (was: Spark Core)

> Fix `IsolatedClientLoader.supportsHadoopShadedClient` to handle Hadoop 3.4
> --
>
> Key: SPARK-47457
> URL: https://issues.apache.org/jira/browse/SPARK-47457
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47457) Fix `IsolatedClientLoader.supportsHadoopShadedClient` to handle Hadoop 3.4

2024-03-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47457:
---
Labels: pull-request-available  (was: )

> Fix `IsolatedClientLoader.supportsHadoopShadedClient` to handle Hadoop 3.4
> --
>
> Key: SPARK-47457
> URL: https://issues.apache.org/jira/browse/SPARK-47457
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47457) Fix `IsolatedClientLoader.supportsHadoopShadedClient` to handle Hadoop 3.4

2024-03-18 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-47457:
-

 Summary: Fix `IsolatedClientLoader.supportsHadoopShadedClient` to 
handle Hadoop 3.4
 Key: SPARK-47457
 URL: https://issues.apache.org/jira/browse/SPARK-47457
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47452) Use `Ubuntu 22.04` in `dev/infra/Dockerfile`

2024-03-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47452.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45576
[https://github.com/apache/spark/pull/45576]

> Use `Ubuntu 22.04` in `dev/infra/Dockerfile`
> 
>
> Key: SPARK-47452
> URL: https://issues.apache.org/jira/browse/SPARK-47452
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47456) Support ORC Brotli codec

2024-03-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47456:
--
Parent: SPARK-44111
Issue Type: Sub-task  (was: Improvement)

> Support ORC Brotli codec
> 
>
> Key: SPARK-47456
> URL: https://issues.apache.org/jira/browse/SPARK-47456
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47456) Support ORC Brotli codec

2024-03-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47456:
-

Assignee: dzcxzl

> Support ORC Brotli codec
> 
>
> Key: SPARK-47456
> URL: https://issues.apache.org/jira/browse/SPARK-47456
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47456) Support ORC Brotli codec

2024-03-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47456.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45584
[https://github.com/apache/spark/pull/45584]

> Support ORC Brotli codec
> 
>
> Key: SPARK-47456
> URL: https://issues.apache.org/jira/browse/SPARK-47456
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47435) SPARK-45561 causes mysql unsigned tinyint overflow

2024-03-18 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-47435:
-
Fix Version/s: 3.5.2

> SPARK-45561 causes mysql unsigned tinyint overflow
> --
>
> Key: SPARK-47435
> URL: https://issues.apache.org/jira/browse/SPARK-47435
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47453) Upgrade MySQL docker image version to 8.3.0

2024-03-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47453.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45581
[https://github.com/apache/spark/pull/45581]

> Upgrade MySQL docker image version to 8.3.0
> ---
>
> Key: SPARK-47453
> URL: https://issues.apache.org/jira/browse/SPARK-47453
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Docker, SQL, Tests
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47453) Upgrade MySQL docker image version to 8.3.0

2024-03-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47453:
-

Assignee: Kent Yao

> Upgrade MySQL docker image version to 8.3.0
> ---
>
> Key: SPARK-47453
> URL: https://issues.apache.org/jira/browse/SPARK-47453
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Docker, SQL, Tests
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47422) Support collated strings in array operations

2024-03-18 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47422.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45563
[https://github.com/apache/spark/pull/45563]

> Support collated strings in array operations
> 
>
> Key: SPARK-47422
> URL: https://issues.apache.org/jira/browse/SPARK-47422
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Assignee: Nikola Mandic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Collations need to be properly supported in following array operations but 
> currently yield unexpected results: ArraysOverlap, ArrayDistinct, ArrayUnion, 
> ArrayIntersect, ArrayExcept. Example query:
> {code:java}
> select array_contains(array('aaa' collate utf8_binary_lcase), 'AAA' collate 
> utf8_binary_lcase){code}
> We would expect the result of query to be true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47422) Support collated strings in array operations

2024-03-18 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-47422:
---

Assignee: Nikola Mandic

> Support collated strings in array operations
> 
>
> Key: SPARK-47422
> URL: https://issues.apache.org/jira/browse/SPARK-47422
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Assignee: Nikola Mandic
>Priority: Major
>  Labels: pull-request-available
>
> Collations need to be properly supported in following array operations but 
> currently yield unexpected results: ArraysOverlap, ArrayDistinct, ArrayUnion, 
> ArrayIntersect, ArrayExcept. Example query:
> {code:java}
> select array_contains(array('aaa' collate utf8_binary_lcase), 'AAA' collate 
> utf8_binary_lcase){code}
> We would expect the result of query to be true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47456) Support ORC Brotli codec

2024-03-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47456:
---
Labels: pull-request-available  (was: )

> Support ORC Brotli codec
> 
>
> Key: SPARK-47456
> URL: https://issues.apache.org/jira/browse/SPARK-47456
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: dzcxzl
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47456) Support ORC Brotli codec

2024-03-18 Thread dzcxzl (Jira)

dzcxzl created SPARK-47456:
--

 Summary: Support ORC Brotli codec
 Key: SPARK-47456
 URL: https://issues.apache.org/jira/browse/SPARK-47456
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: dzcxzl






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45393) Upgrade Hadoop to 3.4.0

2024-03-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45393:
---
Labels: pull-request-available  (was: )

> Upgrade Hadoop to 3.4.0
> ---
>
> Key: SPARK-45393
> URL: https://issues.apache.org/jira/browse/SPARK-45393
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47455) Fix Resource Handling of `scalaStyleOnCompileConfig` in SparkBuild.scala

2024-03-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47455:
---
Labels: pull-request-available  (was: )

> Fix Resource Handling of `scalaStyleOnCompileConfig` in SparkBuild.scala
> 
>
> Key: SPARK-47455
> URL: https://issues.apache.org/jira/browse/SPARK-47455
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.4.2, 4.0.0, 3.5.1
>Reporter: Yang Jie
>Priority: Minor
>  Labels: pull-request-available
>
> [https://github.com/apache/spark/blob/e01ed0da22f24204fe23143032ff39be7f4b56af/project/SparkBuild.scala#L157-L173]
>  
> {code:java}
> val scalaStyleOnCompileConfig: String = {
>     val in = "scalastyle-config.xml"
>     val out = "scalastyle-on-compile.generated.xml"
>     val replacements = Map(
>       """customId="println" level="error -> """customId="println" 
> level="warn
>     )
>     var contents = Source.fromFile(in).getLines.mkString("\n")
>     for ((k, v) <- replacements) {
>       require(contents.contains(k), s"Could not rewrite '$k' in original 
> scalastyle config.")
>       contents = contents.replace(k, v)
>     }
>     new PrintWriter(out) {
>       write(contents)
>       close()
>     }
>     out
>   } {code}
> `Source.fromFile(in)` opens a `BufferedSource` resource handle, but it does 
> not close it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47455) Fix Resource Handling of `scalaStyleOnCompileConfig` in SparkBuild.scala

2024-03-18 Thread Yang Jie (Jira)

Yang Jie created SPARK-47455:


 Summary: Fix Resource Handling of `scalaStyleOnCompileConfig` in 
SparkBuild.scala
 Key: SPARK-47455
 URL: https://issues.apache.org/jira/browse/SPARK-47455
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Affects Versions: 3.5.1, 3.4.2, 4.0.0
Reporter: Yang Jie


[https://github.com/apache/spark/blob/e01ed0da22f24204fe23143032ff39be7f4b56af/project/SparkBuild.scala#L157-L173]

 
{code:java}
val scalaStyleOnCompileConfig: String = {
    val in = "scalastyle-config.xml"
    val out = "scalastyle-on-compile.generated.xml"
    val replacements = Map(
      """customId="println" level="error -> """customId="println" 
level="warn
    )
    var contents = Source.fromFile(in).getLines.mkString("\n")
    for ((k, v) <- replacements) {
      require(contents.contains(k), s"Could not rewrite '$k' in original 
scalastyle config.")
      contents = contents.replace(k, v)
    }
    new PrintWriter(out) {
      write(contents)
      close()
    }
    out
  } {code}
`Source.fromFile(in)` opens a `BufferedSource` resource handle, but it does not 
close it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47453) Upgrade MySQL docker image version to 8.3.0

2024-03-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47453:
---
Labels: pull-request-available  (was: )

> Upgrade MySQL docker image version to 8.3.0
> ---
>
> Key: SPARK-47453
> URL: https://issues.apache.org/jira/browse/SPARK-47453
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Docker, SQL, Tests
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47453) Upgrade MySQL docker image version to 8.3.0

2024-03-18 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-47453:
-
Priority: Minor  (was: Major)

> Upgrade MySQL docker image version to 8.3.0
> ---
>
> Key: SPARK-47453
> URL: https://issues.apache.org/jira/browse/SPARK-47453
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Docker, SQL, Tests
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47453) Upgrade MySQL docker image version to 8.3.0

2024-03-18 Thread Kent Yao (Jira)

Kent Yao created SPARK-47453:


 Summary: Upgrade MySQL docker image version to 8.3.0
 Key: SPARK-47453
 URL: https://issues.apache.org/jira/browse/SPARK-47453
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Docker, SQL, Tests
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47329) Persist df while using foreachbatch and stateful streaming query to prevent state from being re-loaded in each batch

2024-03-18 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-47329.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45432
[https://github.com/apache/spark/pull/45432]

> Persist df while using foreachbatch and stateful streaming query to prevent 
> state from being re-loaded in each batch
> 
>
> Key: SPARK-47329
> URL: https://issues.apache.org/jira/browse/SPARK-47329
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Anish Shrigondekar
>Assignee: Anish Shrigondekar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Persist df while using foreachbatch and stateful streaming query to prevent 
> state from being re-loaded in each batch



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-46990) Regression: Unable to load empty avro files emitted by event-hubs

2024-03-18 Thread Ivan Sadikov (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-46990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828143#comment-17828143
 ] 

Ivan Sadikov commented on SPARK-46990:
--

Opened PR https://github.com/apache/spark/pull/45578.

> Regression: Unable to load empty avro files emitted by event-hubs
> -
>
> Key: SPARK-46990
> URL: https://issues.apache.org/jira/browse/SPARK-46990
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.5.0
> Environment: Databricks 14.0 - 14.3 (spark 3.5.0)
>Reporter: Kamil Kandzia
>Priority: Major
>  Labels: pull-request-available
> Attachments: second=02.avro
>
>
> In azure, I use databricks and event-hubs. Up until spark version 3.4.1 (in 
> databricks as 13.3 LTS) empty avro files emitted by event-hubs can be read. 
> Since version 3.5.0, it is impossible to load these files (even if I have 
> multiple avro files to load and one of them is empty, it can't perform an 
> operation like count or save). I tested this on databricks versions 14.0, 
> 14.1, 14.2, 14.3 and it doesn't work properly in any of them.
> I use the following code:
>  
> {code:java}
> df = spark.read.format("avro") \                 
> .load('abfss://@.dfs.core.windows.net///0/2024/02/05/22/46/10.avro')
>     
> df.count() <- in this operation the spark hangs{code}
> I am sending a fragment of logs from databricks and query plan:
> {code:java}
> 24/02/06 10:03:10 INFO ProgressReporter$: Added result fetcher for 
> 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9
> 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 1; threshold: 32
> 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 24/02/06 10:03:11 INFO InMemoryFileIndex: It took 9 ms to list leaf files for 
> 1 paths.
> 24/02/06 10:03:11 INFO ProgressReporter$: Removed result fetcher for 
> 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9
> 24/02/06 10:03:12 INFO ProgressReporter$: Added result fetcher for 
> 2734305632140666820_6526693737104909881_a07acddb350f44a284cac52db0b2fb21
> 24/02/06 10:03:12 INFO ClusterLoadMonitor: Added query with execution ID:38. 
> Current active queries:1
> 24/02/06 10:03:12 INFO FileSourceStrategy: Pushed Filters: 
> 24/02/06 10:03:12 INFO FileSourceStrategy: Post-Scan Filters: 
> 24/02/06 10:03:12 INFO CodeGenerator: Code generated in 10.636308 ms
> 24/02/06 10:03:12 INFO MemoryStore: Block broadcast_34 stored as values in 
> memory (estimated size 409.5 KiB, free 3.3 GiB)
> 24/02/06 10:03:12 INFO MemoryStore: Block broadcast_34_piece0 stored as bytes 
> in memory (estimated size 14.5 KiB, free 3.3 GiB)
> 24/02/06 10:03:12 INFO BlockManagerInfo: Added broadcast_34_piece0 in memory 
> on :43781 (size: 14.5 KiB, free: 3.3 GiB)
> 24/02/06 10:03:12 INFO SparkContext: Created broadcast 34 from 
> $anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63
> 24/02/06 10:03:12 INFO FileSourceScanExec: Planning scan with bin packing, 
> max split size: 4194304 bytes, max partition size: 4194304, open cost is 
> considered as scanning 4194304 bytes.
> 24/02/06 10:03:12 INFO DAGScheduler: Registering RDD 104 
> ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) as input 
> to shuffle 11
> 24/02/06 10:03:12 INFO DAGScheduler: Got map stage job 22 
> ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) with 1 
> output partitions
> 24/02/06 10:03:12 INFO DAGScheduler: Final stage: ShuffleMapStage 31 
> ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63)
> 24/02/06 10:03:12 INFO DAGScheduler: Parents of final stage: List()
> 24/02/06 10:03:12 INFO DAGScheduler: Missing parents: List()
> 24/02/06 10:03:12 INFO DAGScheduler: Submitting ShuffleMapStage 31 
> (MapPartitionsRDD[104] at $anonfun$withThreadLocalCaptured$5 at 
> LexicalThreadLocal.scala:63), which has no missing parents
> 24/02/06 10:03:12 INFO DAGScheduler: Submitting 1 missing tasks from 
> ShuffleMapStage 31 (MapPartitionsRDD[104] at 
> $anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) (first 15 
> tasks are for partitions Vector(0))
> 24/02/06 10:03:12 INFO TaskSchedulerImpl: Adding task set 31.0 with 1 tasks 
> resource profile 0
> 24/02/06 10:03:12 INFO TaskSetManager: TaskSet 31.0 using PreferredLocationsV1
> 24/02/06 10:03:12 WARN FairSchedulableBuilder: A job was submitted with 
> scheduler pool 2734305632140666820, which has not been configured. This can 
> happen when the file that pools are read from isn't set, or when that file 
> doesn't contain 2734305632140666820. Created 2734305632140666820 with default 
> configuration (schedulingMode: FIFO,

[jira] [Updated] (SPARK-46990) Regression: Unable to load empty avro files emitted by event-hubs

2024-03-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46990:
---
Labels: pull-request-available  (was: )

> Regression: Unable to load empty avro files emitted by event-hubs
> -
>
> Key: SPARK-46990
> URL: https://issues.apache.org/jira/browse/SPARK-46990
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.5.0
> Environment: Databricks 14.0 - 14.3 (spark 3.5.0)
>Reporter: Kamil Kandzia
>Priority: Major
>  Labels: pull-request-available
> Attachments: second=02.avro
>
>
> In azure, I use databricks and event-hubs. Up until spark version 3.4.1 (in 
> databricks as 13.3 LTS) empty avro files emitted by event-hubs can be read. 
> Since version 3.5.0, it is impossible to load these files (even if I have 
> multiple avro files to load and one of them is empty, it can't perform an 
> operation like count or save). I tested this on databricks versions 14.0, 
> 14.1, 14.2, 14.3 and it doesn't work properly in any of them.
> I use the following code:
>  
> {code:java}
> df = spark.read.format("avro") \                 
> .load('abfss://@.dfs.core.windows.net///0/2024/02/05/22/46/10.avro')
>     
> df.count() <- in this operation the spark hangs{code}
> I am sending a fragment of logs from databricks and query plan:
> {code:java}
> 24/02/06 10:03:10 INFO ProgressReporter$: Added result fetcher for 
> 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9
> 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 1; threshold: 32
> 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 24/02/06 10:03:11 INFO InMemoryFileIndex: It took 9 ms to list leaf files for 
> 1 paths.
> 24/02/06 10:03:11 INFO ProgressReporter$: Removed result fetcher for 
> 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9
> 24/02/06 10:03:12 INFO ProgressReporter$: Added result fetcher for 
> 2734305632140666820_6526693737104909881_a07acddb350f44a284cac52db0b2fb21
> 24/02/06 10:03:12 INFO ClusterLoadMonitor: Added query with execution ID:38. 
> Current active queries:1
> 24/02/06 10:03:12 INFO FileSourceStrategy: Pushed Filters: 
> 24/02/06 10:03:12 INFO FileSourceStrategy: Post-Scan Filters: 
> 24/02/06 10:03:12 INFO CodeGenerator: Code generated in 10.636308 ms
> 24/02/06 10:03:12 INFO MemoryStore: Block broadcast_34 stored as values in 
> memory (estimated size 409.5 KiB, free 3.3 GiB)
> 24/02/06 10:03:12 INFO MemoryStore: Block broadcast_34_piece0 stored as bytes 
> in memory (estimated size 14.5 KiB, free 3.3 GiB)
> 24/02/06 10:03:12 INFO BlockManagerInfo: Added broadcast_34_piece0 in memory 
> on :43781 (size: 14.5 KiB, free: 3.3 GiB)
> 24/02/06 10:03:12 INFO SparkContext: Created broadcast 34 from 
> $anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63
> 24/02/06 10:03:12 INFO FileSourceScanExec: Planning scan with bin packing, 
> max split size: 4194304 bytes, max partition size: 4194304, open cost is 
> considered as scanning 4194304 bytes.
> 24/02/06 10:03:12 INFO DAGScheduler: Registering RDD 104 
> ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) as input 
> to shuffle 11
> 24/02/06 10:03:12 INFO DAGScheduler: Got map stage job 22 
> ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) with 1 
> output partitions
> 24/02/06 10:03:12 INFO DAGScheduler: Final stage: ShuffleMapStage 31 
> ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63)
> 24/02/06 10:03:12 INFO DAGScheduler: Parents of final stage: List()
> 24/02/06 10:03:12 INFO DAGScheduler: Missing parents: List()
> 24/02/06 10:03:12 INFO DAGScheduler: Submitting ShuffleMapStage 31 
> (MapPartitionsRDD[104] at $anonfun$withThreadLocalCaptured$5 at 
> LexicalThreadLocal.scala:63), which has no missing parents
> 24/02/06 10:03:12 INFO DAGScheduler: Submitting 1 missing tasks from 
> ShuffleMapStage 31 (MapPartitionsRDD[104] at 
> $anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) (first 15 
> tasks are for partitions Vector(0))
> 24/02/06 10:03:12 INFO TaskSchedulerImpl: Adding task set 31.0 with 1 tasks 
> resource profile 0
> 24/02/06 10:03:12 INFO TaskSetManager: TaskSet 31.0 using PreferredLocationsV1
> 24/02/06 10:03:12 WARN FairSchedulableBuilder: A job was submitted with 
> scheduler pool 2734305632140666820, which has not been configured. This can 
> happen when the file that pools are read from isn't set, or when that file 
> doesn't contain 2734305632140666820. Created 2734305632140666820 with default 
> configuration (schedulingMode: FIFO, minShare: 0, weight: 1)
> 24/02/06 10:03:12 INFO FairSchedulabl

[jira] [Updated] (SPARK-47452) Use `Ubuntu 22.04` in `dev/infra/Dockerfile`

2024-03-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47452:
--
Summary: Use `Ubuntu 22.04` in `dev/infra/Dockerfile`  (was: Use Ubuntu 
22.04 in `dev/infra/Dockerfile`)

> Use `Ubuntu 22.04` in `dev/infra/Dockerfile`
> 
>
> Key: SPARK-47452
> URL: https://issues.apache.org/jira/browse/SPARK-47452
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47452) Use `Ubuntu 22.04` in `dev/infra/Dockerfile`

2024-03-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47452:
---
Labels: pull-request-available  (was: )

> Use `Ubuntu 22.04` in `dev/infra/Dockerfile`
> 
>
> Key: SPARK-47452
> URL: https://issues.apache.org/jira/browse/SPARK-47452
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47452) Use Ubuntu 22.04 in `dev/infra/Dockerfile`

2024-03-18 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-47452:
-

 Summary: Use Ubuntu 22.04 in `dev/infra/Dockerfile`
 Key: SPARK-47452
 URL: https://issues.apache.org/jira/browse/SPARK-47452
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47450) Use R 4.3.3 in `windows` R GitHub Action job

2024-03-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47450.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45574
[https://github.com/apache/spark/pull/45574]

> Use R 4.3.3 in `windows` R GitHub Action job
> 
>
> Key: SPARK-47450
> URL: https://issues.apache.org/jira/browse/SPARK-47450
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra, R
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47448) Enable spark.shuffle.service.removeShuffle by default

2024-03-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47448.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45572
[https://github.com/apache/spark/pull/45572]

> Enable spark.shuffle.service.removeShuffle by default
> -
>
> Key: SPARK-47448
> URL: https://issues.apache.org/jira/browse/SPARK-47448
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47451) Support to_json(variant)

2024-03-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47451:
---
Labels: pull-request-available  (was: )

> Support to_json(variant)
> 
>
> Key: SPARK-47451
> URL: https://issues.apache.org/jira/browse/SPARK-47451
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Chenhao Li
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47451) Support to_json(variant)

2024-03-18 Thread Chenhao Li (Jira)

Chenhao Li created SPARK-47451:
--

 Summary: Support to_json(variant)
 Key: SPARK-47451
 URL: https://issues.apache.org/jira/browse/SPARK-47451
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Chenhao Li






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47450) Use R 4.3.3 in `windows` R GitHub Action job

2024-03-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47450:
---
Labels: pull-request-available  (was: )

> Use R 4.3.3 in `windows` R GitHub Action job
> 
>
> Key: SPARK-47450
> URL: https://issues.apache.org/jira/browse/SPARK-47450
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra, R
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45921) Use Hadoop 3.3.5 winutils in AppVeyor build

2024-03-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45921:
--
Parent: SPARK-44111
Issue Type: Sub-task  (was: Improvement)

> Use Hadoop 3.3.5 winutils in AppVeyor build
> ---
>
> Key: SPARK-45921
> URL: https://issues.apache.org/jira/browse/SPARK-45921
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45995) Upgrade R version from 4.3.1 to 4.3.2 in AppVeyor

2024-03-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45995:
--
Parent: SPARK-44111
Issue Type: Sub-task  (was: Improvement)

> Upgrade R version from 4.3.1 to 4.3.2 in AppVeyor
> -
>
> Key: SPARK-45995
> URL: https://issues.apache.org/jira/browse/SPARK-45995
> Project: Spark
>  Issue Type: Sub-task
>  Components: R
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> https://cran.r-project.org/doc/manuals/r-release/NEWS.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-46990) Regression: Unable to load empty avro files emitted by event-hubs

2024-03-18 Thread Pavlo Pohrrebnyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-46990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828113#comment-17828113
 ] 

Pavlo Pohrrebnyi commented on SPARK-46990:
--

[~ivan.sadikov], feel free to use it. That is a standard EventHub capture file, 
with azure defined schema and no data inside. 

> Regression: Unable to load empty avro files emitted by event-hubs
> -
>
> Key: SPARK-46990
> URL: https://issues.apache.org/jira/browse/SPARK-46990
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.5.0
> Environment: Databricks 14.0 - 14.3 (spark 3.5.0)
>Reporter: Kamil Kandzia
>Priority: Major
> Attachments: second=02.avro
>
>
> In azure, I use databricks and event-hubs. Up until spark version 3.4.1 (in 
> databricks as 13.3 LTS) empty avro files emitted by event-hubs can be read. 
> Since version 3.5.0, it is impossible to load these files (even if I have 
> multiple avro files to load and one of them is empty, it can't perform an 
> operation like count or save). I tested this on databricks versions 14.0, 
> 14.1, 14.2, 14.3 and it doesn't work properly in any of them.
> I use the following code:
>  
> {code:java}
> df = spark.read.format("avro") \                 
> .load('abfss://@.dfs.core.windows.net///0/2024/02/05/22/46/10.avro')
>     
> df.count() <- in this operation the spark hangs{code}
> I am sending a fragment of logs from databricks and query plan:
> {code:java}
> 24/02/06 10:03:10 INFO ProgressReporter$: Added result fetcher for 
> 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9
> 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 1; threshold: 32
> 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 24/02/06 10:03:11 INFO InMemoryFileIndex: It took 9 ms to list leaf files for 
> 1 paths.
> 24/02/06 10:03:11 INFO ProgressReporter$: Removed result fetcher for 
> 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9
> 24/02/06 10:03:12 INFO ProgressReporter$: Added result fetcher for 
> 2734305632140666820_6526693737104909881_a07acddb350f44a284cac52db0b2fb21
> 24/02/06 10:03:12 INFO ClusterLoadMonitor: Added query with execution ID:38. 
> Current active queries:1
> 24/02/06 10:03:12 INFO FileSourceStrategy: Pushed Filters: 
> 24/02/06 10:03:12 INFO FileSourceStrategy: Post-Scan Filters: 
> 24/02/06 10:03:12 INFO CodeGenerator: Code generated in 10.636308 ms
> 24/02/06 10:03:12 INFO MemoryStore: Block broadcast_34 stored as values in 
> memory (estimated size 409.5 KiB, free 3.3 GiB)
> 24/02/06 10:03:12 INFO MemoryStore: Block broadcast_34_piece0 stored as bytes 
> in memory (estimated size 14.5 KiB, free 3.3 GiB)
> 24/02/06 10:03:12 INFO BlockManagerInfo: Added broadcast_34_piece0 in memory 
> on :43781 (size: 14.5 KiB, free: 3.3 GiB)
> 24/02/06 10:03:12 INFO SparkContext: Created broadcast 34 from 
> $anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63
> 24/02/06 10:03:12 INFO FileSourceScanExec: Planning scan with bin packing, 
> max split size: 4194304 bytes, max partition size: 4194304, open cost is 
> considered as scanning 4194304 bytes.
> 24/02/06 10:03:12 INFO DAGScheduler: Registering RDD 104 
> ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) as input 
> to shuffle 11
> 24/02/06 10:03:12 INFO DAGScheduler: Got map stage job 22 
> ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) with 1 
> output partitions
> 24/02/06 10:03:12 INFO DAGScheduler: Final stage: ShuffleMapStage 31 
> ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63)
> 24/02/06 10:03:12 INFO DAGScheduler: Parents of final stage: List()
> 24/02/06 10:03:12 INFO DAGScheduler: Missing parents: List()
> 24/02/06 10:03:12 INFO DAGScheduler: Submitting ShuffleMapStage 31 
> (MapPartitionsRDD[104] at $anonfun$withThreadLocalCaptured$5 at 
> LexicalThreadLocal.scala:63), which has no missing parents
> 24/02/06 10:03:12 INFO DAGScheduler: Submitting 1 missing tasks from 
> ShuffleMapStage 31 (MapPartitionsRDD[104] at 
> $anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) (first 15 
> tasks are for partitions Vector(0))
> 24/02/06 10:03:12 INFO TaskSchedulerImpl: Adding task set 31.0 with 1 tasks 
> resource profile 0
> 24/02/06 10:03:12 INFO TaskSetManager: TaskSet 31.0 using PreferredLocationsV1
> 24/02/06 10:03:12 WARN FairSchedulableBuilder: A job was submitted with 
> scheduler pool 2734305632140666820, which has not been configured. This can 
> happen when the file that pools are read from isn't set, or when that file 
> doesn't contain 2734305632140666820. Created 2734305632140666820 with default 
>

[jira] [Created] (SPARK-47449) Refactor and split list/timer unit tests

2024-03-18 Thread Jing Zhan (Jira)

Jing Zhan created SPARK-47449:
-

 Summary: Refactor and split list/timer unit tests
 Key: SPARK-47449
 URL: https://issues.apache.org/jira/browse/SPARK-47449
 Project: Spark
  Issue Type: Task
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Jing Zhan


Refactor ListState and timer related unit tests.
As planned in test plan for state-v2, list/timer should be tested in both 
integration and unit tests. Currently timer related tests could be refactored 
to use base suite class in {{{}ValueStateSuite{}}}, and list state unit tests 
are needed in addition to {{{}TransformWithListStateSuite{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-46990) Regression: Unable to load empty avro files emitted by event-hubs

2024-03-18 Thread Ivan Sadikov (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-46990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828107#comment-17828107
 ] 

Ivan Sadikov commented on SPARK-46990:
--

Thanks, Kamil. I am still debugging, will try to open a PR with the fix today 
or tomorrow.

[~pashashiz] Is it okay to use the provided sample file in the PR for a unit 
test? 

I will also reach out to Databricks to fix it there.

> Regression: Unable to load empty avro files emitted by event-hubs
> -
>
> Key: SPARK-46990
> URL: https://issues.apache.org/jira/browse/SPARK-46990
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.5.0
> Environment: Databricks 14.0 - 14.3 (spark 3.5.0)
>Reporter: Kamil Kandzia
>Priority: Major
> Attachments: second=02.avro
>
>
> In azure, I use databricks and event-hubs. Up until spark version 3.4.1 (in 
> databricks as 13.3 LTS) empty avro files emitted by event-hubs can be read. 
> Since version 3.5.0, it is impossible to load these files (even if I have 
> multiple avro files to load and one of them is empty, it can't perform an 
> operation like count or save). I tested this on databricks versions 14.0, 
> 14.1, 14.2, 14.3 and it doesn't work properly in any of them.
> I use the following code:
>  
> {code:java}
> df = spark.read.format("avro") \                 
> .load('abfss://@.dfs.core.windows.net///0/2024/02/05/22/46/10.avro')
>     
> df.count() <- in this operation the spark hangs{code}
> I am sending a fragment of logs from databricks and query plan:
> {code:java}
> 24/02/06 10:03:10 INFO ProgressReporter$: Added result fetcher for 
> 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9
> 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 1; threshold: 32
> 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 24/02/06 10:03:11 INFO InMemoryFileIndex: It took 9 ms to list leaf files for 
> 1 paths.
> 24/02/06 10:03:11 INFO ProgressReporter$: Removed result fetcher for 
> 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9
> 24/02/06 10:03:12 INFO ProgressReporter$: Added result fetcher for 
> 2734305632140666820_6526693737104909881_a07acddb350f44a284cac52db0b2fb21
> 24/02/06 10:03:12 INFO ClusterLoadMonitor: Added query with execution ID:38. 
> Current active queries:1
> 24/02/06 10:03:12 INFO FileSourceStrategy: Pushed Filters: 
> 24/02/06 10:03:12 INFO FileSourceStrategy: Post-Scan Filters: 
> 24/02/06 10:03:12 INFO CodeGenerator: Code generated in 10.636308 ms
> 24/02/06 10:03:12 INFO MemoryStore: Block broadcast_34 stored as values in 
> memory (estimated size 409.5 KiB, free 3.3 GiB)
> 24/02/06 10:03:12 INFO MemoryStore: Block broadcast_34_piece0 stored as bytes 
> in memory (estimated size 14.5 KiB, free 3.3 GiB)
> 24/02/06 10:03:12 INFO BlockManagerInfo: Added broadcast_34_piece0 in memory 
> on :43781 (size: 14.5 KiB, free: 3.3 GiB)
> 24/02/06 10:03:12 INFO SparkContext: Created broadcast 34 from 
> $anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63
> 24/02/06 10:03:12 INFO FileSourceScanExec: Planning scan with bin packing, 
> max split size: 4194304 bytes, max partition size: 4194304, open cost is 
> considered as scanning 4194304 bytes.
> 24/02/06 10:03:12 INFO DAGScheduler: Registering RDD 104 
> ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) as input 
> to shuffle 11
> 24/02/06 10:03:12 INFO DAGScheduler: Got map stage job 22 
> ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) with 1 
> output partitions
> 24/02/06 10:03:12 INFO DAGScheduler: Final stage: ShuffleMapStage 31 
> ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63)
> 24/02/06 10:03:12 INFO DAGScheduler: Parents of final stage: List()
> 24/02/06 10:03:12 INFO DAGScheduler: Missing parents: List()
> 24/02/06 10:03:12 INFO DAGScheduler: Submitting ShuffleMapStage 31 
> (MapPartitionsRDD[104] at $anonfun$withThreadLocalCaptured$5 at 
> LexicalThreadLocal.scala:63), which has no missing parents
> 24/02/06 10:03:12 INFO DAGScheduler: Submitting 1 missing tasks from 
> ShuffleMapStage 31 (MapPartitionsRDD[104] at 
> $anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) (first 15 
> tasks are for partitions Vector(0))
> 24/02/06 10:03:12 INFO TaskSchedulerImpl: Adding task set 31.0 with 1 tasks 
> resource profile 0
> 24/02/06 10:03:12 INFO TaskSetManager: TaskSet 31.0 using PreferredLocationsV1
> 24/02/06 10:03:12 WARN FairSchedulableBuilder: A job was submitted with 
> scheduler pool 2734305632140666820, which has not been configured. This can 
> happen when the file that pools are read from isn't set, or whe

[jira] [Commented] (SPARK-46990) Regression: Unable to load empty avro files emitted by event-hubs

2024-03-18 Thread Kamil Kandzia (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-46990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828106#comment-17828106
 ] 

Kamil Kandzia commented on SPARK-46990:
---

It is likely that the cause appeared earlier than you created your fixes. I 
observed this on databricks 14.0, which had a release in September 2023. 
According to details [Databricks Runtime 14.0 | Databricks on 
AWS|https://docs.databricks.com/en/release-notes/runtime/14.0.html]  it doesn't 
contains your changes from SPARK-46633.

> Regression: Unable to load empty avro files emitted by event-hubs
> -
>
> Key: SPARK-46990
> URL: https://issues.apache.org/jira/browse/SPARK-46990
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.5.0
> Environment: Databricks 14.0 - 14.3 (spark 3.5.0)
>Reporter: Kamil Kandzia
>Priority: Major
> Attachments: second=02.avro
>
>
> In azure, I use databricks and event-hubs. Up until spark version 3.4.1 (in 
> databricks as 13.3 LTS) empty avro files emitted by event-hubs can be read. 
> Since version 3.5.0, it is impossible to load these files (even if I have 
> multiple avro files to load and one of them is empty, it can't perform an 
> operation like count or save). I tested this on databricks versions 14.0, 
> 14.1, 14.2, 14.3 and it doesn't work properly in any of them.
> I use the following code:
>  
> {code:java}
> df = spark.read.format("avro") \                 
> .load('abfss://@.dfs.core.windows.net///0/2024/02/05/22/46/10.avro')
>     
> df.count() <- in this operation the spark hangs{code}
> I am sending a fragment of logs from databricks and query plan:
> {code:java}
> 24/02/06 10:03:10 INFO ProgressReporter$: Added result fetcher for 
> 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9
> 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 1; threshold: 32
> 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 24/02/06 10:03:11 INFO InMemoryFileIndex: It took 9 ms to list leaf files for 
> 1 paths.
> 24/02/06 10:03:11 INFO ProgressReporter$: Removed result fetcher for 
> 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9
> 24/02/06 10:03:12 INFO ProgressReporter$: Added result fetcher for 
> 2734305632140666820_6526693737104909881_a07acddb350f44a284cac52db0b2fb21
> 24/02/06 10:03:12 INFO ClusterLoadMonitor: Added query with execution ID:38. 
> Current active queries:1
> 24/02/06 10:03:12 INFO FileSourceStrategy: Pushed Filters: 
> 24/02/06 10:03:12 INFO FileSourceStrategy: Post-Scan Filters: 
> 24/02/06 10:03:12 INFO CodeGenerator: Code generated in 10.636308 ms
> 24/02/06 10:03:12 INFO MemoryStore: Block broadcast_34 stored as values in 
> memory (estimated size 409.5 KiB, free 3.3 GiB)
> 24/02/06 10:03:12 INFO MemoryStore: Block broadcast_34_piece0 stored as bytes 
> in memory (estimated size 14.5 KiB, free 3.3 GiB)
> 24/02/06 10:03:12 INFO BlockManagerInfo: Added broadcast_34_piece0 in memory 
> on :43781 (size: 14.5 KiB, free: 3.3 GiB)
> 24/02/06 10:03:12 INFO SparkContext: Created broadcast 34 from 
> $anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63
> 24/02/06 10:03:12 INFO FileSourceScanExec: Planning scan with bin packing, 
> max split size: 4194304 bytes, max partition size: 4194304, open cost is 
> considered as scanning 4194304 bytes.
> 24/02/06 10:03:12 INFO DAGScheduler: Registering RDD 104 
> ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) as input 
> to shuffle 11
> 24/02/06 10:03:12 INFO DAGScheduler: Got map stage job 22 
> ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) with 1 
> output partitions
> 24/02/06 10:03:12 INFO DAGScheduler: Final stage: ShuffleMapStage 31 
> ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63)
> 24/02/06 10:03:12 INFO DAGScheduler: Parents of final stage: List()
> 24/02/06 10:03:12 INFO DAGScheduler: Missing parents: List()
> 24/02/06 10:03:12 INFO DAGScheduler: Submitting ShuffleMapStage 31 
> (MapPartitionsRDD[104] at $anonfun$withThreadLocalCaptured$5 at 
> LexicalThreadLocal.scala:63), which has no missing parents
> 24/02/06 10:03:12 INFO DAGScheduler: Submitting 1 missing tasks from 
> ShuffleMapStage 31 (MapPartitionsRDD[104] at 
> $anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) (first 15 
> tasks are for partitions Vector(0))
> 24/02/06 10:03:12 INFO TaskSchedulerImpl: Adding task set 31.0 with 1 tasks 
> resource profile 0
> 24/02/06 10:03:12 INFO TaskSetManager: TaskSet 31.0 using PreferredLocationsV1
> 24/02/06 10:03:12 WARN FairSchedulableBuilder: A job was submitted with 
> scheduler pool 2734305632140666820, w

[jira] [Commented] (SPARK-46990) Regression: Unable to load empty avro files emitted by event-hubs

2024-03-18 Thread Ivan Sadikov (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-46990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828100#comment-17828100
 ] 

Ivan Sadikov commented on SPARK-46990:
--

Yes, sure. Thanks for reporting. I will take a look and open a PR once I have 
root caused the issue. 

> Regression: Unable to load empty avro files emitted by event-hubs
> -
>
> Key: SPARK-46990
> URL: https://issues.apache.org/jira/browse/SPARK-46990
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.5.0
> Environment: Databricks 14.0 - 14.3 (spark 3.5.0)
>Reporter: Kamil Kandzia
>Priority: Major
> Attachments: second=02.avro
>
>
> In azure, I use databricks and event-hubs. Up until spark version 3.4.1 (in 
> databricks as 13.3 LTS) empty avro files emitted by event-hubs can be read. 
> Since version 3.5.0, it is impossible to load these files (even if I have 
> multiple avro files to load and one of them is empty, it can't perform an 
> operation like count or save). I tested this on databricks versions 14.0, 
> 14.1, 14.2, 14.3 and it doesn't work properly in any of them.
> I use the following code:
>  
> {code:java}
> df = spark.read.format("avro") \                 
> .load('abfss://@.dfs.core.windows.net///0/2024/02/05/22/46/10.avro')
>     
> df.count() <- in this operation the spark hangs{code}
> I am sending a fragment of logs from databricks and query plan:
> {code:java}
> 24/02/06 10:03:10 INFO ProgressReporter$: Added result fetcher for 
> 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9
> 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 1; threshold: 32
> 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 24/02/06 10:03:11 INFO InMemoryFileIndex: It took 9 ms to list leaf files for 
> 1 paths.
> 24/02/06 10:03:11 INFO ProgressReporter$: Removed result fetcher for 
> 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9
> 24/02/06 10:03:12 INFO ProgressReporter$: Added result fetcher for 
> 2734305632140666820_6526693737104909881_a07acddb350f44a284cac52db0b2fb21
> 24/02/06 10:03:12 INFO ClusterLoadMonitor: Added query with execution ID:38. 
> Current active queries:1
> 24/02/06 10:03:12 INFO FileSourceStrategy: Pushed Filters: 
> 24/02/06 10:03:12 INFO FileSourceStrategy: Post-Scan Filters: 
> 24/02/06 10:03:12 INFO CodeGenerator: Code generated in 10.636308 ms
> 24/02/06 10:03:12 INFO MemoryStore: Block broadcast_34 stored as values in 
> memory (estimated size 409.5 KiB, free 3.3 GiB)
> 24/02/06 10:03:12 INFO MemoryStore: Block broadcast_34_piece0 stored as bytes 
> in memory (estimated size 14.5 KiB, free 3.3 GiB)
> 24/02/06 10:03:12 INFO BlockManagerInfo: Added broadcast_34_piece0 in memory 
> on :43781 (size: 14.5 KiB, free: 3.3 GiB)
> 24/02/06 10:03:12 INFO SparkContext: Created broadcast 34 from 
> $anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63
> 24/02/06 10:03:12 INFO FileSourceScanExec: Planning scan with bin packing, 
> max split size: 4194304 bytes, max partition size: 4194304, open cost is 
> considered as scanning 4194304 bytes.
> 24/02/06 10:03:12 INFO DAGScheduler: Registering RDD 104 
> ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) as input 
> to shuffle 11
> 24/02/06 10:03:12 INFO DAGScheduler: Got map stage job 22 
> ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) with 1 
> output partitions
> 24/02/06 10:03:12 INFO DAGScheduler: Final stage: ShuffleMapStage 31 
> ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63)
> 24/02/06 10:03:12 INFO DAGScheduler: Parents of final stage: List()
> 24/02/06 10:03:12 INFO DAGScheduler: Missing parents: List()
> 24/02/06 10:03:12 INFO DAGScheduler: Submitting ShuffleMapStage 31 
> (MapPartitionsRDD[104] at $anonfun$withThreadLocalCaptured$5 at 
> LexicalThreadLocal.scala:63), which has no missing parents
> 24/02/06 10:03:12 INFO DAGScheduler: Submitting 1 missing tasks from 
> ShuffleMapStage 31 (MapPartitionsRDD[104] at 
> $anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) (first 15 
> tasks are for partitions Vector(0))
> 24/02/06 10:03:12 INFO TaskSchedulerImpl: Adding task set 31.0 with 1 tasks 
> resource profile 0
> 24/02/06 10:03:12 INFO TaskSetManager: TaskSet 31.0 using PreferredLocationsV1
> 24/02/06 10:03:12 WARN FairSchedulableBuilder: A job was submitted with 
> scheduler pool 2734305632140666820, which has not been configured. This can 
> happen when the file that pools are read from isn't set, or when that file 
> doesn't contain 2734305632140666820. Created 2734305632140666820 with default 
> configuration (schedulingMode: FI

[jira] [Commented] (SPARK-46990) Regression: Unable to load empty avro files emitted by event-hubs

2024-03-18 Thread Kamil Kandzia (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-46990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828095#comment-17828095
 ] 

Kamil Kandzia commented on SPARK-46990:
---

Could you [~ivan.sadikov] look into issue? Pavlo has attached example avro file 
(I forgot to do this).

> Regression: Unable to load empty avro files emitted by event-hubs
> -
>
> Key: SPARK-46990
> URL: https://issues.apache.org/jira/browse/SPARK-46990
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.5.0
> Environment: Databricks 14.0 - 14.3 (spark 3.5.0)
>Reporter: Kamil Kandzia
>Priority: Major
> Attachments: second=02.avro
>
>
> In azure, I use databricks and event-hubs. Up until spark version 3.4.1 (in 
> databricks as 13.3 LTS) empty avro files emitted by event-hubs can be read. 
> Since version 3.5.0, it is impossible to load these files (even if I have 
> multiple avro files to load and one of them is empty, it can't perform an 
> operation like count or save). I tested this on databricks versions 14.0, 
> 14.1, 14.2, 14.3 and it doesn't work properly in any of them.
> I use the following code:
>  
> {code:java}
> df = spark.read.format("avro") \                 
> .load('abfss://@.dfs.core.windows.net///0/2024/02/05/22/46/10.avro')
>     
> df.count() <- in this operation the spark hangs{code}
> I am sending a fragment of logs from databricks and query plan:
> {code:java}
> 24/02/06 10:03:10 INFO ProgressReporter$: Added result fetcher for 
> 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9
> 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 1; threshold: 32
> 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 24/02/06 10:03:11 INFO InMemoryFileIndex: It took 9 ms to list leaf files for 
> 1 paths.
> 24/02/06 10:03:11 INFO ProgressReporter$: Removed result fetcher for 
> 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9
> 24/02/06 10:03:12 INFO ProgressReporter$: Added result fetcher for 
> 2734305632140666820_6526693737104909881_a07acddb350f44a284cac52db0b2fb21
> 24/02/06 10:03:12 INFO ClusterLoadMonitor: Added query with execution ID:38. 
> Current active queries:1
> 24/02/06 10:03:12 INFO FileSourceStrategy: Pushed Filters: 
> 24/02/06 10:03:12 INFO FileSourceStrategy: Post-Scan Filters: 
> 24/02/06 10:03:12 INFO CodeGenerator: Code generated in 10.636308 ms
> 24/02/06 10:03:12 INFO MemoryStore: Block broadcast_34 stored as values in 
> memory (estimated size 409.5 KiB, free 3.3 GiB)
> 24/02/06 10:03:12 INFO MemoryStore: Block broadcast_34_piece0 stored as bytes 
> in memory (estimated size 14.5 KiB, free 3.3 GiB)
> 24/02/06 10:03:12 INFO BlockManagerInfo: Added broadcast_34_piece0 in memory 
> on :43781 (size: 14.5 KiB, free: 3.3 GiB)
> 24/02/06 10:03:12 INFO SparkContext: Created broadcast 34 from 
> $anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63
> 24/02/06 10:03:12 INFO FileSourceScanExec: Planning scan with bin packing, 
> max split size: 4194304 bytes, max partition size: 4194304, open cost is 
> considered as scanning 4194304 bytes.
> 24/02/06 10:03:12 INFO DAGScheduler: Registering RDD 104 
> ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) as input 
> to shuffle 11
> 24/02/06 10:03:12 INFO DAGScheduler: Got map stage job 22 
> ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) with 1 
> output partitions
> 24/02/06 10:03:12 INFO DAGScheduler: Final stage: ShuffleMapStage 31 
> ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63)
> 24/02/06 10:03:12 INFO DAGScheduler: Parents of final stage: List()
> 24/02/06 10:03:12 INFO DAGScheduler: Missing parents: List()
> 24/02/06 10:03:12 INFO DAGScheduler: Submitting ShuffleMapStage 31 
> (MapPartitionsRDD[104] at $anonfun$withThreadLocalCaptured$5 at 
> LexicalThreadLocal.scala:63), which has no missing parents
> 24/02/06 10:03:12 INFO DAGScheduler: Submitting 1 missing tasks from 
> ShuffleMapStage 31 (MapPartitionsRDD[104] at 
> $anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) (first 15 
> tasks are for partitions Vector(0))
> 24/02/06 10:03:12 INFO TaskSchedulerImpl: Adding task set 31.0 with 1 tasks 
> resource profile 0
> 24/02/06 10:03:12 INFO TaskSetManager: TaskSet 31.0 using PreferredLocationsV1
> 24/02/06 10:03:12 WARN FairSchedulableBuilder: A job was submitted with 
> scheduler pool 2734305632140666820, which has not been configured. This can 
> happen when the file that pools are read from isn't set, or when that file 
> doesn't contain 2734305632140666820. Created 2734305632140666820 with default 
> configuration (schedulingMode:

[jira] [Assigned] (SPARK-47448) Enable spark.shuffle.service.removeShuffle by default

2024-03-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47448:
-

Assignee: Dongjoon Hyun

> Enable spark.shuffle.service.removeShuffle by default
> -
>
> Key: SPARK-47448
> URL: https://issues.apache.org/jira/browse/SPARK-47448
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45375) [CORE] Mark connection as timedOut in TransportClient.close

2024-03-18 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-45375:
---

Assignee: Hasnain Lakhani

> [CORE] Mark connection as timedOut in TransportClient.close
> ---
>
> Key: SPARK-45375
> URL: https://issues.apache.org/jira/browse/SPARK-45375
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.4.2, 4.0.0, 3.5.1, 3.3.4
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>
> Avoid a race condition where a connection which is in the process of being 
> closed could be returned by the TransportClientFactory only to be immediately 
> closed and cause errors upon use
>  
> This doesn't happen much in practice but is observed more frequently as part 
> of efforts to add SSL support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-45374) [CORE] Add test keys for SSL functionality

2024-03-18 Thread Mridul Muralidharan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828087#comment-17828087
 ] 

Mridul Muralidharan edited comment on SPARK-45374 at 3/18/24 8:28 PM:
--

Missed your query, you can link by:
"more' -> link -> Web Link ->
* URL  == pr url
* Link test == "GitHub Pull Request #"

I did it for this PR, please lte me know if you are unable to do it for the 
others


was (Author: mridulm80):
Missed your query, you can link by:
"more' -> link -> Web Link ->
* URL  == pr url
* Link test == "GitHub Pull Request #"

I did it for this PR

> [CORE] Add test keys for SSL functionality
> --
>
> Key: SPARK-45374
> URL: https://issues.apache.org/jira/browse/SPARK-45374
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>
> Add test SSL keys which will be used for unit and integration tests of the 
> new SSL RPC functionality



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45374) [CORE] Add test keys for SSL functionality

2024-03-18 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-45374:
---

Assignee: Hasnain Lakhani

> [CORE] Add test keys for SSL functionality
> --
>
> Key: SPARK-45374
> URL: https://issues.apache.org/jira/browse/SPARK-45374
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>
> Add test SSL keys which will be used for unit and integration tests of the 
> new SSL RPC functionality



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47448) Enable spark.shuffle.service.removeShuffle by default

2024-03-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47448:
---
Labels: pull-request-available  (was: )

> Enable spark.shuffle.service.removeShuffle by default
> -
>
> Key: SPARK-47448
> URL: https://issues.apache.org/jira/browse/SPARK-47448
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47448) Enable spark.shuffle.service.removeShuffle by default

2024-03-18 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-47448:
-

 Summary: Enable spark.shuffle.service.removeShuffle by default
 Key: SPARK-47448
 URL: https://issues.apache.org/jira/browse/SPARK-47448
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-46990) Regression: Unable to load empty avro files emitted by event-hubs

2024-03-18 Thread Pavlo Pohrrebnyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-46990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828090#comment-17828090
 ] 

Pavlo Pohrrebnyi edited comment on SPARK-46990 at 3/18/24 8:22 PM:
---

We are experiencing the same with Spark 3.5. That is likely caused by 
SPARK-46633. Here is the change 
[PR-44635|https://github.com/apache/spark/pull/44635/files#diff-c139f61eabcfcb9725c8caeb747becae061a2ea44f774b12c9cce5aeac102880]

The job hangs once tries to read avro files with no records. It loops forever 
here:
{code:java}
org.apache.spark.sql.avro.AvroUtils$RowReader.hasNextRow(AvroUtils.scala:265) 
org.apache.spark.sql.avro.AvroUtils$RowReader.hasNextRow$(AvroUtils.scala:263) 
org.apache.spark.sql.avro.AvroFileFormat$$anon$1.hasNextRow(AvroFileFormat.scala:186)
 
org.apache.spark.sql.avro.AvroFileFormat$$anon$1.hasNext(AvroFileFormat.scala:201)
 
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:604)
 
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:798)
 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.$anonfun$hasNext$1(FileScanRDD.scala:506)
 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$Lambda$1269/1728924771.apply$mcZ$sp(Unknown
 Source)  {code}
Here is the sample to reproduce the issue: [^second=02.avro]


was (Author: pashashiz):
We are experiencing the same with Spark 3.5. That is likely caused by 
SPARK-46633. Here is the change 
[PR-44635|https://github.com/apache/spark/pull/44635/files#diff-c139f61eabcfcb9725c8caeb747becae061a2ea44f774b12c9cce5aeac102880]]

The job hangs once tries to read avro files with no records. It loops forever 
here:
{code:java}
org.apache.spark.sql.avro.AvroUtils$RowReader.hasNextRow(AvroUtils.scala:265) 
org.apache.spark.sql.avro.AvroUtils$RowReader.hasNextRow$(AvroUtils.scala:263) 
org.apache.spark.sql.avro.AvroFileFormat$$anon$1.hasNextRow(AvroFileFormat.scala:186)
 
org.apache.spark.sql.avro.AvroFileFormat$$anon$1.hasNext(AvroFileFormat.scala:201)
 
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:604)
 
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:798)
 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.$anonfun$hasNext$1(FileScanRDD.scala:506)
 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$Lambda$1269/1728924771.apply$mcZ$sp(Unknown
 Source)  {code}
Here is the sample to reproduce the issue: [^second=02.avro]

> Regression: Unable to load empty avro files emitted by event-hubs
> -
>
> Key: SPARK-46990
> URL: https://issues.apache.org/jira/browse/SPARK-46990
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.5.0
> Environment: Databricks 14.0 - 14.3 (spark 3.5.0)
>Reporter: Kamil Kandzia
>Priority: Major
> Attachments: second=02.avro
>
>
> In azure, I use databricks and event-hubs. Up until spark version 3.4.1 (in 
> databricks as 13.3 LTS) empty avro files emitted by event-hubs can be read. 
> Since version 3.5.0, it is impossible to load these files (even if I have 
> multiple avro files to load and one of them is empty, it can't perform an 
> operation like count or save). I tested this on databricks versions 14.0, 
> 14.1, 14.2, 14.3 and it doesn't work properly in any of them.
> I use the following code:
>  
> {code:java}
> df = spark.read.format("avro") \                 
> .load('abfss://@.dfs.core.windows.net///0/2024/02/05/22/46/10.avro')
>     
> df.count() <- in this operation the spark hangs{code}
> I am sending a fragment of logs from databricks and query plan:
> {code:java}
> 24/02/06 10:03:10 INFO ProgressReporter$: Added result fetcher for 
> 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9
> 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 1; threshold: 32
> 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 24/02/06 10:03:11 INFO InMemoryFileIndex: It took 9 ms to list leaf files for 
> 1 paths.
> 24/02/06 10:03:11 INFO ProgressReporter$: Removed result fetcher for 
> 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9
> 24/02/06 10:03:12 INFO ProgressReporter$: Added result fetcher for 
> 2734305632140666820_6526693737104909881_a07acddb350f44a284cac52db0b2fb21
> 24/

[jira] [Comment Edited] (SPARK-46990) Regression: Unable to load empty avro files emitted by event-hubs

2024-03-18 Thread Pavlo Pohrrebnyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-46990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828090#comment-17828090
 ] 

Pavlo Pohrrebnyi edited comment on SPARK-46990 at 3/18/24 8:21 PM:
---

We are experiencing the same with Spark 3.5. That is likely caused by 
SPARK-46633. Here is the change 
[PR-44635|https://github.com/apache/spark/pull/44635/files#diff-c139f61eabcfcb9725c8caeb747becae061a2ea44f774b12c9cce5aeac102880]]

The job hangs once tries to read avro files with no records. It loops forever 
here:
{code:java}
org.apache.spark.sql.avro.AvroUtils$RowReader.hasNextRow(AvroUtils.scala:265) 
org.apache.spark.sql.avro.AvroUtils$RowReader.hasNextRow$(AvroUtils.scala:263) 
org.apache.spark.sql.avro.AvroFileFormat$$anon$1.hasNextRow(AvroFileFormat.scala:186)
 
org.apache.spark.sql.avro.AvroFileFormat$$anon$1.hasNext(AvroFileFormat.scala:201)
 
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:604)
 
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:798)
 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.$anonfun$hasNext$1(FileScanRDD.scala:506)
 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$Lambda$1269/1728924771.apply$mcZ$sp(Unknown
 Source)  {code}
Here is the sample to reproduce the issue: [^second=02.avro]


was (Author: pashashiz):
We are experiencing the same with Spark 3.5. That is likely caused by 
SPARK-46633. Here is the change 
[PR-[44635|https://github.com/apache/spark/pull/44635/files#diff-c139f61eabcfcb9725c8caeb747becae061a2ea44f774b12c9cce5aeac102880]|https://github.com/apache/spark/pull/44635/files#diff-c139f61eabcfcb9725c8caeb747becae061a2ea44f774b12c9cce5aeac102880]

The job hangs once tries to read avro files with no records. It loops forever 
here:
{code:java}
org.apache.spark.sql.avro.AvroUtils$RowReader.hasNextRow(AvroUtils.scala:265) 
org.apache.spark.sql.avro.AvroUtils$RowReader.hasNextRow$(AvroUtils.scala:263) 
org.apache.spark.sql.avro.AvroFileFormat$$anon$1.hasNextRow(AvroFileFormat.scala:186)
 
org.apache.spark.sql.avro.AvroFileFormat$$anon$1.hasNext(AvroFileFormat.scala:201)
 
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:604)
 
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:798)
 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.$anonfun$hasNext$1(FileScanRDD.scala:506)
 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$Lambda$1269/1728924771.apply$mcZ$sp(Unknown
 Source)  {code}
Here is the sample to reproduce the issue: [^second=02.avro]

> Regression: Unable to load empty avro files emitted by event-hubs
> -
>
> Key: SPARK-46990
> URL: https://issues.apache.org/jira/browse/SPARK-46990
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.5.0
> Environment: Databricks 14.0 - 14.3 (spark 3.5.0)
>Reporter: Kamil Kandzia
>Priority: Major
> Attachments: second=02.avro
>
>
> In azure, I use databricks and event-hubs. Up until spark version 3.4.1 (in 
> databricks as 13.3 LTS) empty avro files emitted by event-hubs can be read. 
> Since version 3.5.0, it is impossible to load these files (even if I have 
> multiple avro files to load and one of them is empty, it can't perform an 
> operation like count or save). I tested this on databricks versions 14.0, 
> 14.1, 14.2, 14.3 and it doesn't work properly in any of them.
> I use the following code:
>  
> {code:java}
> df = spark.read.format("avro") \                 
> .load('abfss://@.dfs.core.windows.net///0/2024/02/05/22/46/10.avro')
>     
> df.count() <- in this operation the spark hangs{code}
> I am sending a fragment of logs from databricks and query plan:
> {code:java}
> 24/02/06 10:03:10 INFO ProgressReporter$: Added result fetcher for 
> 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9
> 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 1; threshold: 32
> 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 24/02/06 10:03:11 INFO InMemoryFileIndex: It took 9 ms to list leaf files for 
> 1 paths.
> 24/02/06 10:03:11 INFO ProgressReporter$: Removed result fetcher for 
> 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9
> 24/02/06 10:03:12 INFO Prog

[jira] [Comment Edited] (SPARK-46990) Regression: Unable to load empty avro files emitted by event-hubs

2024-03-18 Thread Pavlo Pohrrebnyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-46990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828090#comment-17828090
 ] 

Pavlo Pohrrebnyi edited comment on SPARK-46990 at 3/18/24 8:21 PM:
---

We are experiencing the same with Spark 3.5. That is likely caused by 
SPARK-46633. Here is the change 
[PR-[44635|https://github.com/apache/spark/pull/44635/files#diff-c139f61eabcfcb9725c8caeb747becae061a2ea44f774b12c9cce5aeac102880]|https://github.com/apache/spark/pull/44635/files#diff-c139f61eabcfcb9725c8caeb747becae061a2ea44f774b12c9cce5aeac102880]

The job hangs once tries to read avro files with no records. It loops forever 
here:
{code:java}
org.apache.spark.sql.avro.AvroUtils$RowReader.hasNextRow(AvroUtils.scala:265) 
org.apache.spark.sql.avro.AvroUtils$RowReader.hasNextRow$(AvroUtils.scala:263) 
org.apache.spark.sql.avro.AvroFileFormat$$anon$1.hasNextRow(AvroFileFormat.scala:186)
 
org.apache.spark.sql.avro.AvroFileFormat$$anon$1.hasNext(AvroFileFormat.scala:201)
 
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:604)
 
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:798)
 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.$anonfun$hasNext$1(FileScanRDD.scala:506)
 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$Lambda$1269/1728924771.apply$mcZ$sp(Unknown
 Source)  {code}
Here is the sample to reproduce the issue: [^second=02.avro]


was (Author: pashashiz):
We are experiencing the same with Spark 3.5. That is likely caused by 
[SPARK-46633|https://issues.apache.org/jira/browse/SPARK-46633]. Here is the 
change 
[PR-[44635|https://github.com/apache/spark/pull/44635/files#diff-c139f61eabcfcb9725c8caeb747becae061a2ea44f774b12c9cce5aeac102880]|[https://github.com/apache/spark/pull/44635/files#diff-c139f61eabcfcb9725c8caeb747becae061a2ea44f774b12c9cce5aeac102880]][.|https://github.com/apache/spark/pull/44635/files#diff-c139f61eabcfcb9725c8caeb747becae061a2ea44f774b12c9cce5aeac102880]

The job hangs once tries to read avro files with no records. It loops forever 
here:
{code:java}
org.apache.spark.sql.avro.AvroUtils$RowReader.hasNextRow(AvroUtils.scala:265) 
org.apache.spark.sql.avro.AvroUtils$RowReader.hasNextRow$(AvroUtils.scala:263) 
org.apache.spark.sql.avro.AvroFileFormat$$anon$1.hasNextRow(AvroFileFormat.scala:186)
 
org.apache.spark.sql.avro.AvroFileFormat$$anon$1.hasNext(AvroFileFormat.scala:201)
 
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:604)
 
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:798)
 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.$anonfun$hasNext$1(FileScanRDD.scala:506)
 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$Lambda$1269/1728924771.apply$mcZ$sp(Unknown
 Source)  {code}
Here is the sample to reproduce the issue: [^second=02.avro]

> Regression: Unable to load empty avro files emitted by event-hubs
> -
>
> Key: SPARK-46990
> URL: https://issues.apache.org/jira/browse/SPARK-46990
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.5.0
> Environment: Databricks 14.0 - 14.3 (spark 3.5.0)
>Reporter: Kamil Kandzia
>Priority: Major
> Attachments: second=02.avro
>
>
> In azure, I use databricks and event-hubs. Up until spark version 3.4.1 (in 
> databricks as 13.3 LTS) empty avro files emitted by event-hubs can be read. 
> Since version 3.5.0, it is impossible to load these files (even if I have 
> multiple avro files to load and one of them is empty, it can't perform an 
> operation like count or save). I tested this on databricks versions 14.0, 
> 14.1, 14.2, 14.3 and it doesn't work properly in any of them.
> I use the following code:
>  
> {code:java}
> df = spark.read.format("avro") \                 
> .load('abfss://@.dfs.core.windows.net///0/2024/02/05/22/46/10.avro')
>     
> df.count() <- in this operation the spark hangs{code}
> I am sending a fragment of logs from databricks and query plan:
> {code:java}
> 24/02/06 10:03:10 INFO ProgressReporter$: Added result fetcher for 
> 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9
> 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 1; threshold: 32
> 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Siz

[jira] [Commented] (SPARK-46990) Regression: Unable to load empty avro files emitted by event-hubs

2024-03-18 Thread Pavlo Pohrrebnyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-46990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828090#comment-17828090
 ] 

Pavlo Pohrrebnyi commented on SPARK-46990:
--

We are experiencing the same with Spark 3.5. That is likely caused by 
[SPARK-46633|https://issues.apache.org/jira/browse/SPARK-46633]. Here is the 
change 
[PR-[44635|https://github.com/apache/spark/pull/44635/files#diff-c139f61eabcfcb9725c8caeb747becae061a2ea44f774b12c9cce5aeac102880]|[https://github.com/apache/spark/pull/44635/files#diff-c139f61eabcfcb9725c8caeb747becae061a2ea44f774b12c9cce5aeac102880]][.|https://github.com/apache/spark/pull/44635/files#diff-c139f61eabcfcb9725c8caeb747becae061a2ea44f774b12c9cce5aeac102880]

The job hangs once tries to read avro files with no records. It loops forever 
here:
{code:java}
org.apache.spark.sql.avro.AvroUtils$RowReader.hasNextRow(AvroUtils.scala:265) 
org.apache.spark.sql.avro.AvroUtils$RowReader.hasNextRow$(AvroUtils.scala:263) 
org.apache.spark.sql.avro.AvroFileFormat$$anon$1.hasNextRow(AvroFileFormat.scala:186)
 
org.apache.spark.sql.avro.AvroFileFormat$$anon$1.hasNext(AvroFileFormat.scala:201)
 
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:604)
 
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:798)
 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.$anonfun$hasNext$1(FileScanRDD.scala:506)
 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$Lambda$1269/1728924771.apply$mcZ$sp(Unknown
 Source)  {code}
Here is the sample to reproduce the issue: [^second=02.avro]

> Regression: Unable to load empty avro files emitted by event-hubs
> -
>
> Key: SPARK-46990
> URL: https://issues.apache.org/jira/browse/SPARK-46990
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.5.0
> Environment: Databricks 14.0 - 14.3 (spark 3.5.0)
>Reporter: Kamil Kandzia
>Priority: Major
> Attachments: second=02.avro
>
>
> In azure, I use databricks and event-hubs. Up until spark version 3.4.1 (in 
> databricks as 13.3 LTS) empty avro files emitted by event-hubs can be read. 
> Since version 3.5.0, it is impossible to load these files (even if I have 
> multiple avro files to load and one of them is empty, it can't perform an 
> operation like count or save). I tested this on databricks versions 14.0, 
> 14.1, 14.2, 14.3 and it doesn't work properly in any of them.
> I use the following code:
>  
> {code:java}
> df = spark.read.format("avro") \                 
> .load('abfss://@.dfs.core.windows.net///0/2024/02/05/22/46/10.avro')
>     
> df.count() <- in this operation the spark hangs{code}
> I am sending a fragment of logs from databricks and query plan:
> {code:java}
> 24/02/06 10:03:10 INFO ProgressReporter$: Added result fetcher for 
> 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9
> 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 1; threshold: 32
> 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 24/02/06 10:03:11 INFO InMemoryFileIndex: It took 9 ms to list leaf files for 
> 1 paths.
> 24/02/06 10:03:11 INFO ProgressReporter$: Removed result fetcher for 
> 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9
> 24/02/06 10:03:12 INFO ProgressReporter$: Added result fetcher for 
> 2734305632140666820_6526693737104909881_a07acddb350f44a284cac52db0b2fb21
> 24/02/06 10:03:12 INFO ClusterLoadMonitor: Added query with execution ID:38. 
> Current active queries:1
> 24/02/06 10:03:12 INFO FileSourceStrategy: Pushed Filters: 
> 24/02/06 10:03:12 INFO FileSourceStrategy: Post-Scan Filters: 
> 24/02/06 10:03:12 INFO CodeGenerator: Code generated in 10.636308 ms
> 24/02/06 10:03:12 INFO MemoryStore: Block broadcast_34 stored as values in 
> memory (estimated size 409.5 KiB, free 3.3 GiB)
> 24/02/06 10:03:12 INFO MemoryStore: Block broadcast_34_piece0 stored as bytes 
> in memory (estimated size 14.5 KiB, free 3.3 GiB)
> 24/02/06 10:03:12 INFO BlockManagerInfo: Added broadcast_34_piece0 in memory 
> on :43781 (size: 14.5 KiB, free: 3.3 GiB)
> 24/02/06 10:03:12 INFO SparkContext: Created broadcast 34 from 
> $anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63
> 24/02/06 10:03:12 INFO FileSourceScanExec: Planning scan with bin packing, 
> max split size: 4194304 bytes, max partition size: 4194304, open cost is 
> considered as scanning 4194304 bytes.
> 24/02/06 10:03:12 INFO DAGScheduler: Registering RDD 104

[jira] [Comment Edited] (SPARK-45374) [CORE] Add test keys for SSL functionality

2024-03-18 Thread Mridul Muralidharan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828087#comment-17828087
 ] 

Mridul Muralidharan edited comment on SPARK-45374 at 3/18/24 8:05 PM:
--

Missed your query, you can link by:
"more' -> link -> Web Link ->
* URL  == pr url
* Link test == "GitHub Pull Request #"

I did it for this PR


was (Author: mridulm80):
Missed your query, you can link by:
"more' -> link -> Web Link ->
* URL  == pr url
* Link test == "GitHub Pull Request #"

> [CORE] Add test keys for SSL functionality
> --
>
> Key: SPARK-45374
> URL: https://issues.apache.org/jira/browse/SPARK-45374
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Priority: Major
>
> Add test SSL keys which will be used for unit and integration tests of the 
> new SSL RPC functionality



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46990) Regression: Unable to load empty avro files emitted by event-hubs

2024-03-18 Thread Pavlo Pohrrebnyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavlo Pohrrebnyi updated SPARK-46990:
-
Attachment: second=02.avro

> Regression: Unable to load empty avro files emitted by event-hubs
> -
>
> Key: SPARK-46990
> URL: https://issues.apache.org/jira/browse/SPARK-46990
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.5.0
> Environment: Databricks 14.0 - 14.3 (spark 3.5.0)
>Reporter: Kamil Kandzia
>Priority: Major
> Attachments: second=02.avro
>
>
> In azure, I use databricks and event-hubs. Up until spark version 3.4.1 (in 
> databricks as 13.3 LTS) empty avro files emitted by event-hubs can be read. 
> Since version 3.5.0, it is impossible to load these files (even if I have 
> multiple avro files to load and one of them is empty, it can't perform an 
> operation like count or save). I tested this on databricks versions 14.0, 
> 14.1, 14.2, 14.3 and it doesn't work properly in any of them.
> I use the following code:
>  
> {code:java}
> df = spark.read.format("avro") \                 
> .load('abfss://@.dfs.core.windows.net///0/2024/02/05/22/46/10.avro')
>     
> df.count() <- in this operation the spark hangs{code}
> I am sending a fragment of logs from databricks and query plan:
> {code:java}
> 24/02/06 10:03:10 INFO ProgressReporter$: Added result fetcher for 
> 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9
> 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 1; threshold: 32
> 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 24/02/06 10:03:11 INFO InMemoryFileIndex: It took 9 ms to list leaf files for 
> 1 paths.
> 24/02/06 10:03:11 INFO ProgressReporter$: Removed result fetcher for 
> 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9
> 24/02/06 10:03:12 INFO ProgressReporter$: Added result fetcher for 
> 2734305632140666820_6526693737104909881_a07acddb350f44a284cac52db0b2fb21
> 24/02/06 10:03:12 INFO ClusterLoadMonitor: Added query with execution ID:38. 
> Current active queries:1
> 24/02/06 10:03:12 INFO FileSourceStrategy: Pushed Filters: 
> 24/02/06 10:03:12 INFO FileSourceStrategy: Post-Scan Filters: 
> 24/02/06 10:03:12 INFO CodeGenerator: Code generated in 10.636308 ms
> 24/02/06 10:03:12 INFO MemoryStore: Block broadcast_34 stored as values in 
> memory (estimated size 409.5 KiB, free 3.3 GiB)
> 24/02/06 10:03:12 INFO MemoryStore: Block broadcast_34_piece0 stored as bytes 
> in memory (estimated size 14.5 KiB, free 3.3 GiB)
> 24/02/06 10:03:12 INFO BlockManagerInfo: Added broadcast_34_piece0 in memory 
> on :43781 (size: 14.5 KiB, free: 3.3 GiB)
> 24/02/06 10:03:12 INFO SparkContext: Created broadcast 34 from 
> $anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63
> 24/02/06 10:03:12 INFO FileSourceScanExec: Planning scan with bin packing, 
> max split size: 4194304 bytes, max partition size: 4194304, open cost is 
> considered as scanning 4194304 bytes.
> 24/02/06 10:03:12 INFO DAGScheduler: Registering RDD 104 
> ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) as input 
> to shuffle 11
> 24/02/06 10:03:12 INFO DAGScheduler: Got map stage job 22 
> ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) with 1 
> output partitions
> 24/02/06 10:03:12 INFO DAGScheduler: Final stage: ShuffleMapStage 31 
> ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63)
> 24/02/06 10:03:12 INFO DAGScheduler: Parents of final stage: List()
> 24/02/06 10:03:12 INFO DAGScheduler: Missing parents: List()
> 24/02/06 10:03:12 INFO DAGScheduler: Submitting ShuffleMapStage 31 
> (MapPartitionsRDD[104] at $anonfun$withThreadLocalCaptured$5 at 
> LexicalThreadLocal.scala:63), which has no missing parents
> 24/02/06 10:03:12 INFO DAGScheduler: Submitting 1 missing tasks from 
> ShuffleMapStage 31 (MapPartitionsRDD[104] at 
> $anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) (first 15 
> tasks are for partitions Vector(0))
> 24/02/06 10:03:12 INFO TaskSchedulerImpl: Adding task set 31.0 with 1 tasks 
> resource profile 0
> 24/02/06 10:03:12 INFO TaskSetManager: TaskSet 31.0 using PreferredLocationsV1
> 24/02/06 10:03:12 WARN FairSchedulableBuilder: A job was submitted with 
> scheduler pool 2734305632140666820, which has not been configured. This can 
> happen when the file that pools are read from isn't set, or when that file 
> doesn't contain 2734305632140666820. Created 2734305632140666820 with default 
> configuration (schedulingMode: FIFO, minShare: 0, weight: 1)
> 24/02/06 10:03:12 INFO FairSchedulableBuilder: Added task set TaskSet_31.0 
> tasks to pool

[jira] [Commented] (SPARK-45374) [CORE] Add test keys for SSL functionality

2024-03-18 Thread Mridul Muralidharan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828087#comment-17828087
 ] 

Mridul Muralidharan commented on SPARK-45374:
-

Missed your query, you can link by:
"more' -> link -> Web Link ->
* URL  == pr url
* Link test == "GitHub Pull Request #"

> [CORE] Add test keys for SSL functionality
> --
>
> Key: SPARK-45374
> URL: https://issues.apache.org/jira/browse/SPARK-45374
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Priority: Major
>
> Add test SSL keys which will be used for unit and integration tests of the 
> new SSL RPC functionality



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47446) Make `BlockManager` warn before `removeBlockInternal`

2024-03-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47446.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45570
[https://github.com/apache/spark/pull/45570]

> Make `BlockManager` warn before `removeBlockInternal`
> -
>
> Key: SPARK-47446
> URL: https://issues.apache.org/jira/browse/SPARK-47446
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> {code}
> 24/03/18 18:40:46 WARN BlockManager: Putting block broadcast_0 failed due to 
> exception java.nio.file.NoSuchFileException: 
> /data/spark/blockmgr-56a6c418-90be-4d89-9707-ef45f7eaf74c/0e.
> 24/03/18 18:40:46 WARN BlockManager: Block broadcast_0 was not removed 
> normally.
> 24/03/18 18:40:46 INFO TaskSchedulerImpl: Cancelling stage 0
> 24/03/18 18:40:46 INFO TaskSchedulerImpl: Killing all running tasks in stage 
> 0: Stage cancelled
> 24/03/18 18:40:46 INFO DAGScheduler: ResultStage 0 (reduce at 
> SparkPi.scala:38) failed in 0.264 s due to Job aborted due to stage failure: 
> Task serialization failed: java.nio.file.NoSuchFileException: 
> /data/spark/blockmgr-56a6c418-90be-4d89-9707-ef45f7eaf74c/0e
> java.nio.file.NoSuchFileException: 
> /data/spark/blockmgr-56a6c418-90be-4d89-9707-ef45f7eaf74c/0e
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47446) Make `BlockManager` warn before `removeBlockInternal`

2024-03-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47446:
-

Assignee: Dongjoon Hyun

> Make `BlockManager` warn before `removeBlockInternal`
> -
>
> Key: SPARK-47446
> URL: https://issues.apache.org/jira/browse/SPARK-47446
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>
> {code}
> 24/03/18 18:40:46 WARN BlockManager: Putting block broadcast_0 failed due to 
> exception java.nio.file.NoSuchFileException: 
> /data/spark/blockmgr-56a6c418-90be-4d89-9707-ef45f7eaf74c/0e.
> 24/03/18 18:40:46 WARN BlockManager: Block broadcast_0 was not removed 
> normally.
> 24/03/18 18:40:46 INFO TaskSchedulerImpl: Cancelling stage 0
> 24/03/18 18:40:46 INFO TaskSchedulerImpl: Killing all running tasks in stage 
> 0: Stage cancelled
> 24/03/18 18:40:46 INFO DAGScheduler: ResultStage 0 (reduce at 
> SparkPi.scala:38) failed in 0.264 s due to Job aborted due to stage failure: 
> Task serialization failed: java.nio.file.NoSuchFileException: 
> /data/spark/blockmgr-56a6c418-90be-4d89-9707-ef45f7eaf74c/0e
> java.nio.file.NoSuchFileException: 
> /data/spark/blockmgr-56a6c418-90be-4d89-9707-ef45f7eaf74c/0e
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47447) Allow reading Parquet TimestampLTZ as TimestampNTZ

2024-03-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47447:
---
Labels: pull-request-available  (was: )

> Allow reading Parquet TimestampLTZ as TimestampNTZ
> --
>
> Key: SPARK-47447
> URL: https://issues.apache.org/jira/browse/SPARK-47447
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>  Labels: pull-request-available
>
> Currently, Parquet TimestampNTZ type columns can be read as TimestampLTZ, 
> while reading TimestampLTZ as TimestampNTZ will cause errors. This makes it 
> impossible to read parquet files containing both TimestampLTZ and 
> TimestampNTZ as TimestampNTZ.
> To make the data type system on Parquet simpler, we should allow reading 
> TimestampLTZ as TimestampNTZ in the Parquet data source.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47447) Allow reading Parquet TimestampLTZ as TimestampNTZ

2024-03-18 Thread Gengliang Wang (Jira)

Gengliang Wang created SPARK-47447:
--

 Summary: Allow reading Parquet TimestampLTZ as TimestampNTZ
 Key: SPARK-47447
 URL: https://issues.apache.org/jira/browse/SPARK-47447
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang


Currently, Parquet TimestampNTZ type columns can be read as TimestampLTZ, while 
reading TimestampLTZ as TimestampNTZ will cause errors. This makes it 
impossible to read parquet files containing both TimestampLTZ and TimestampNTZ 
as TimestampNTZ.

To make the data type system on Parquet simpler, we should allow reading 
TimestampLTZ as TimestampNTZ in the Parquet data source.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47446) Make `BlockManager` warn before `removeBlockInternal`

2024-03-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47446:
---
Labels: pull-request-available  (was: )

> Make `BlockManager` warn before `removeBlockInternal`
> -
>
> Key: SPARK-47446
> URL: https://issues.apache.org/jira/browse/SPARK-47446
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>
> {code}
> 24/03/18 18:40:46 WARN BlockManager: Putting block broadcast_0 failed due to 
> exception java.nio.file.NoSuchFileException: 
> /data/spark/blockmgr-56a6c418-90be-4d89-9707-ef45f7eaf74c/0e.
> 24/03/18 18:40:46 WARN BlockManager: Block broadcast_0 was not removed 
> normally.
> 24/03/18 18:40:46 INFO TaskSchedulerImpl: Cancelling stage 0
> 24/03/18 18:40:46 INFO TaskSchedulerImpl: Killing all running tasks in stage 
> 0: Stage cancelled
> 24/03/18 18:40:46 INFO DAGScheduler: ResultStage 0 (reduce at 
> SparkPi.scala:38) failed in 0.264 s due to Job aborted due to stage failure: 
> Task serialization failed: java.nio.file.NoSuchFileException: 
> /data/spark/blockmgr-56a6c418-90be-4d89-9707-ef45f7eaf74c/0e
> java.nio.file.NoSuchFileException: 
> /data/spark/blockmgr-56a6c418-90be-4d89-9707-ef45f7eaf74c/0e
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47446) Make `BlockManager` warn before `removeBlockInternal`

2024-03-18 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-47446:
-

 Summary: Make `BlockManager` warn before `removeBlockInternal`
 Key: SPARK-47446
 URL: https://issues.apache.org/jira/browse/SPARK-47446
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun


{code}
24/03/18 18:40:46 WARN BlockManager: Putting block broadcast_0 failed due to 
exception java.nio.file.NoSuchFileException: 
/data/spark/blockmgr-56a6c418-90be-4d89-9707-ef45f7eaf74c/0e.
24/03/18 18:40:46 WARN BlockManager: Block broadcast_0 was not removed normally.
24/03/18 18:40:46 INFO TaskSchedulerImpl: Cancelling stage 0
24/03/18 18:40:46 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: 
Stage cancelled
24/03/18 18:40:46 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) 
failed in 0.264 s due to Job aborted due to stage failure: Task serialization 
failed: java.nio.file.NoSuchFileException: 
/data/spark/blockmgr-56a6c418-90be-4d89-9707-ef45f7eaf74c/0e
java.nio.file.NoSuchFileException: 
/data/spark/blockmgr-56a6c418-90be-4d89-9707-ef45f7eaf74c/0e
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47383) Support `spark.shutdown.timeout` config

2024-03-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47383:
--
Summary: Support `spark.shutdown.timeout` config  (was: Make the shutdown 
hook timeout configurable)

> Support `spark.shutdown.timeout` config
> ---
>
> Key: SPARK-47383
> URL: https://issues.apache.org/jira/browse/SPARK-47383
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Rob Reeves
>Assignee: Rob Reeves
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> org.apache.spark.util.ShutdownHookManager is used to register custom shutdown 
> operations. This is not easily configurable. The underlying 
> org.apache.hadoop.util.ShutdownHookManager has a default timeout of 30 
> seconds.  It can be configured by setting hadoop.service.shutdown.timeout, 
> but this must be done in the core-site.xml/core-default.xml because a new 
> hadoop conf object is created and there is no opportunity to modify it.
> org.apache.hadoop.util.ShutdownHookManager provides an overload to pass a 
> custom timeout. Spark should use that and allow a user defined timeout to be 
> used.
> This is useful because we see timeouts during shutdown and want to give some 
> extra time for the event queues to drain to avoid log data loss.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47383) Make the shutdown hook timeout configurable

2024-03-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47383.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45504
[https://github.com/apache/spark/pull/45504]

> Make the shutdown hook timeout configurable
> ---
>
> Key: SPARK-47383
> URL: https://issues.apache.org/jira/browse/SPARK-47383
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Rob Reeves
>Assignee: Rob Reeves
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> org.apache.spark.util.ShutdownHookManager is used to register custom shutdown 
> operations. This is not easily configurable. The underlying 
> org.apache.hadoop.util.ShutdownHookManager has a default timeout of 30 
> seconds.  It can be configured by setting hadoop.service.shutdown.timeout, 
> but this must be done in the core-site.xml/core-default.xml because a new 
> hadoop conf object is created and there is no opportunity to modify it.
> org.apache.hadoop.util.ShutdownHookManager provides an overload to pass a 
> custom timeout. Spark should use that and allow a user defined timeout to be 
> used.
> This is useful because we see timeouts during shutdown and want to give some 
> extra time for the event queues to drain to avoid log data loss.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47383) Make the shutdown hook timeout configurable

2024-03-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47383:
-

Assignee: Rob Reeves

> Make the shutdown hook timeout configurable
> ---
>
> Key: SPARK-47383
> URL: https://issues.apache.org/jira/browse/SPARK-47383
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Rob Reeves
>Assignee: Rob Reeves
>Priority: Minor
>  Labels: pull-request-available
>
> org.apache.spark.util.ShutdownHookManager is used to register custom shutdown 
> operations. This is not easily configurable. The underlying 
> org.apache.hadoop.util.ShutdownHookManager has a default timeout of 30 
> seconds.  It can be configured by setting hadoop.service.shutdown.timeout, 
> but this must be done in the core-site.xml/core-default.xml because a new 
> hadoop conf object is created and there is no opportunity to modify it.
> org.apache.hadoop.util.ShutdownHookManager provides an overload to pass a 
> custom timeout. Spark should use that and allow a user defined timeout to be 
> used.
> This is useful because we see timeouts during shutdown and want to give some 
> extra time for the event queues to drain to avoid log data loss.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47445) Adding new 'Silent' ExplainMode

2024-03-18 Thread Victor Sunderland (Jira)

Victor Sunderland created SPARK-47445:
-

 Summary: Adding new 'Silent' ExplainMode
 Key: SPARK-47445
 URL: https://issues.apache.org/jira/browse/SPARK-47445
 Project: Spark
  Issue Type: Improvement
  Components: Connect, Documentation, PySpark, SQL
Affects Versions: 4.0.0
Reporter: Victor Sunderland


While investigating unit test duration we found that 
org.apache.spark.sql.execution.QueryExecution.explainString () takes 
approximately 14% time. This method generates the string representation of the 
execution plan. The string is often used for logging purposes. This is also 
called for each AQE job so it can save prod execution time too. While 
SPARK-44485 does exist to help optimize the prod execution time, the main 
purpose of this PR is to save time during unit testing.

I've added a silent mode to ExplainMode to try and mitigate this issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47444) Empty numRows table stats should not break Hive tables

2024-03-18 Thread Miklos Szurap (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miklos Szurap updated SPARK-47444:
--
Attachment: reproduction_steps_SPARK-47444.txt

> Empty numRows table stats should not break Hive tables
> --
>
> Key: SPARK-47444
> URL: https://issues.apache.org/jira/browse/SPARK-47444
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8
>Reporter: Miklos Szurap
>Priority: Major
>  Labels: Hive, HiveMetaStoreClient, SQL
> Attachments: reproduction_steps_SPARK-47444.txt
>
>
> A Hive table cannot be accessed / queried / updated from Spark (it is 
> completely "broken") if the "numRows" table property (table stat) is 
> populated with a non-numeric value (like an empty string). Accessing the able 
> from spark results in a "NumberFormatException":
> {code}
> scala> spark.sql("select * from t1p").show()
> java.lang.NumberFormatException: Zero length BigInteger
>   at java.math.BigInteger.(BigInteger.java:420)
> ...
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.org$apache$spark$sql$hive$client$HiveClientImpl$$readHiveStats(HiveClientImpl.scala:1243)
> ...
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:91)
> ...
> {code}
> or
> similarly just with
> {code}
> java.lang.NumberFormatException: For input string: "Foo"
> {code}
> Currently the table stats can be broken through Spark with
> {code}
> scala> spark.sql("alter table t1p set tblproperties('numRows'='', 
> 'STATS_GENERATED_VIA_STATS_TASK'='true')").show()
> {code}
>  
> Spark should:
> 1. Validate sparkSQL "alter table" statements and not allow non-numeric 
> values in the "totalSize", "numRows", "rawDataSize" table properties, as 
> those are checked in the 
> [HiveClientImpl#readHiveStats()|https://github.com/apache/spark/blob/1aafe60b3e7633f755499f5394ca62289e42588d/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L1260C15-L1260C28]
> 2. The HiveClientImpl#readHiveStats should probably tolerate these wrong 
> "totalSize", "numRows", "rawDataSize" table properties and not fail with a 
> cryptic NumberFormatException, but treat those as zero. Or at least it should 
> provide a clue in the error message which table property is incorrect.
> Note: beeline/Hive validates alter table statements, however Impala can 
> similarly break the table, the above item #1 needs to be fixed there too.
> I have checked only the Spark 2.4.x behavior, the same probably exists in 
> Spark 3.x too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47444) Empty numRows table stats should not break Hive tables

2024-03-18 Thread Miklos Szurap (Jira)

Miklos Szurap created SPARK-47444:
-

 Summary: Empty numRows table stats should not break Hive tables
 Key: SPARK-47444
 URL: https://issues.apache.org/jira/browse/SPARK-47444
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.8
Reporter: Miklos Szurap


A Hive table cannot be accessed / queried / updated from Spark (it is 
completely "broken") if the "numRows" table property (table stat) is populated 
with a non-numeric value (like an empty string). Accessing the able from spark 
results in a "NumberFormatException":
{code}
scala> spark.sql("select * from t1p").show()
java.lang.NumberFormatException: Zero length BigInteger
  at java.math.BigInteger.(BigInteger.java:420)
...
  at 
org.apache.spark.sql.hive.client.HiveClientImpl$.org$apache$spark$sql$hive$client$HiveClientImpl$$readHiveStats(HiveClientImpl.scala:1243)
...
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:91)
...
{code}
or
similarly just with
{code}
java.lang.NumberFormatException: For input string: "Foo"
{code}
Currently the table stats can be broken through Spark with
{code}
scala> spark.sql("alter table t1p set tblproperties('numRows'='', 
'STATS_GENERATED_VIA_STATS_TASK'='true')").show()
{code}
 
Spark should:
1. Validate sparkSQL "alter table" statements and not allow non-numeric values 
in the "totalSize", "numRows", "rawDataSize" table properties, as those are 
checked in the 
[HiveClientImpl#readHiveStats()|https://github.com/apache/spark/blob/1aafe60b3e7633f755499f5394ca62289e42588d/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L1260C15-L1260C28]
2. The HiveClientImpl#readHiveStats should probably tolerate these wrong 
"totalSize", "numRows", "rawDataSize" table properties and not fail with a 
cryptic NumberFormatException, but treat those as zero. Or at least it should 
provide a clue in the error message which table property is incorrect.

Note: beeline/Hive validates alter table statements, however Impala can 
similarly break the table, the above item #1 needs to be fixed there too.

I have checked only the Spark 2.4.x behavior, the same probably exists in Spark 
3.x too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47442) Use port 0 to start worker server in MasterSuite

2024-03-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47442.
---
Fix Version/s: 4.0.0
 Assignee: wuyi
   Resolution: Fixed

> Use port 0 to start worker server in MasterSuite
> 
>
> Key: SPARK-47442
> URL: https://issues.apache.org/jira/browse/SPARK-47442
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core, Tests
>Affects Versions: 4.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47435) SPARK-45561 causes mysql unsigned tinyint overflow

2024-03-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47435.
---
Fix Version/s: 4.0.0
 Assignee: Kent Yao
   Resolution: Fixed

> SPARK-45561 causes mysql unsigned tinyint overflow
> --
>
> Key: SPARK-47435
> URL: https://issues.apache.org/jira/browse/SPARK-47435
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47438) Upgrade jackson to 2.17.0

2024-03-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47438.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45562
[https://github.com/apache/spark/pull/45562]

> Upgrade jackson to 2.17.0
> -
>
> Key: SPARK-47438
> URL: https://issues.apache.org/jira/browse/SPARK-47438
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47443) Window aggregate support

2024-03-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47443:
---
Labels: pull-request-available  (was: )

> Window aggregate support
> 
>
> Key: SPARK-47443
> URL: https://issues.apache.org/jira/browse/SPARK-47443
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Aleksandar Tomic
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47443) Window aggregate support

2024-03-18 Thread Aleksandar Tomic (Jira)

Aleksandar Tomic created SPARK-47443:


 Summary: Window aggregate support
 Key: SPARK-47443
 URL: https://issues.apache.org/jira/browse/SPARK-47443
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Aleksandar Tomic






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47442) Use port 0 to start worker server in MasterSuite

2024-03-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47442:
--
Reporter: wuyi  (was: Dongjoon Hyun)

> Use port 0 to start worker server in MasterSuite
> 
>
> Key: SPARK-47442
> URL: https://issues.apache.org/jira/browse/SPARK-47442
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core, Tests
>Affects Versions: 4.0.0
>Reporter: wuyi
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47434) Streaming Statistics link redirect causing 302 error

2024-03-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47434:
-

Assignee: Huw

> Streaming Statistics link redirect causing 302 error
> 
>
> Key: SPARK-47434
> URL: https://issues.apache.org/jira/browse/SPARK-47434
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.4.1, 3.5.1
>Reporter: Huw
>Assignee: Huw
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.5.2
>
>
> When using a reverse proxy, links to streaming statistics page are missing a 
> trailing slash, which causes a redirect (302) to an incorrect path.
> Essentially the same issue as 
> https://issues.apache.org/jira/browse/SPARK-24553 but for a different link.
> .../StreamingQuery/statistics?id=abcd -> 
> .../StreamingQuery/statistics/?id=abcd
> Linked PR [https://github.com/apache/spark/pull/45527/files]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47434) Streaming Statistics link redirect causing 302 error

2024-03-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47434.
---
Fix Version/s: 3.4.3
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 45527
[https://github.com/apache/spark/pull/45527]

> Streaming Statistics link redirect causing 302 error
> 
>
> Key: SPARK-47434
> URL: https://issues.apache.org/jira/browse/SPARK-47434
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.4.1, 3.5.1
>Reporter: Huw
>Assignee: Huw
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.3, 3.5.2, 4.0.0
>
>
> When using a reverse proxy, links to streaming statistics page are missing a 
> trailing slash, which causes a redirect (302) to an incorrect path.
> Essentially the same issue as 
> https://issues.apache.org/jira/browse/SPARK-24553 but for a different link.
> .../StreamingQuery/statistics?id=abcd -> 
> .../StreamingQuery/statistics/?id=abcd
> Linked PR [https://github.com/apache/spark/pull/45527/files]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47442) Use port 0 to start worker server in MasterSuite

2024-03-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47442:
---
Labels: pull-request-available  (was: )

> Use port 0 to start worker server in MasterSuite
> 
>
> Key: SPARK-47442
> URL: https://issues.apache.org/jira/browse/SPARK-47442
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core, Tests
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47442) Use port 0 to start worker server in MasterSuite

2024-03-18 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-47442:
-

 Summary: Use port 0 to start worker server in MasterSuite
 Key: SPARK-47442
 URL: https://issues.apache.org/jira/browse/SPARK-47442
 Project: Spark
  Issue Type: Test
  Components: Spark Core, Tests
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47438) Upgrade jackson to 2.17.0

2024-03-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47438:
--
Parent: SPARK-47046
Issue Type: Sub-task  (was: Improvement)

> Upgrade jackson to 2.17.0
> -
>
> Key: SPARK-47438
> URL: https://issues.apache.org/jira/browse/SPARK-47438
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47441) Do not add log link for unmanaged AM in Spark UI

2024-03-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47441:
---
Labels: pull-request-available  (was: )

> Do not add log link for unmanaged AM in Spark UI
> 
>
> Key: SPARK-47441
> URL: https://issues.apache.org/jira/browse/SPARK-47441
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Yuming Wang
>Priority: Major
>  Labels: pull-request-available
>
> {noformat}
> 24/03/18 04:58:25,022 ERROR [spark-listener-group-appStatus] 
> scheduler.AsyncEventQueue:97 : Listener AppStatusListener threw an exception
> java.lang.NumberFormatException: For input string: "null"
>   at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:67) 
> ~[?:?]
>   at java.lang.Integer.parseInt(Integer.java:668) ~[?:?]
>   at java.lang.Integer.parseInt(Integer.java:786) ~[?:?]
>   at scala.collection.immutable.StringLike.toInt(StringLike.scala:310) 
> ~[scala-library-2.12.18.jar:?]
>   at scala.collection.immutable.StringLike.toInt$(StringLike.scala:310) 
> ~[scala-library-2.12.18.jar:?]
>   at scala.collection.immutable.StringOps.toInt(StringOps.scala:33) 
> ~[scala-library-2.12.18.jar:?]
>   at org.apache.spark.util.Utils$.parseHostPort(Utils.scala:1105) 
> ~[spark-core_2.12-3.5.1.jar:3.5.1]
>   at 
> org.apache.spark.status.ProcessSummaryWrapper.(storeTypes.scala:609) 
> ~[spark-core_2.12-3.5.1.jar:3.5.1]
>   at 
> org.apache.spark.status.LiveMiscellaneousProcess.doUpdate(LiveEntity.scala:1045)
>  ~[spark-core_2.12-3.5.1.jar:3.5.1]
>   at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:50) 
> ~[spark-core_2.12-3.5.1.jar:3.5.1]
>   at 
> org.apache.spark.status.AppStatusListener.update(AppStatusListener.scala:1233)
>  ~[spark-core_2.12-3.5.1.jar:3.5.1]
>   at 
> org.apache.spark.status.AppStatusListener.onMiscellaneousProcessAdded(AppStatusListener.scala:1445)
>  ~[spark-core_2.12-3.5.1.jar:3.5.1]
>   at 
> org.apache.spark.status.AppStatusListener.onOtherEvent(AppStatusListener.scala:113)
>  ~[spark-core_2.12-3.5.1.jar:3.5.1]
>   at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:100)
>  ~[spark-core_2.12-3.5.1.jar:3.5.1]
>   at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
>  ~[spark-core_2.12-3.5.1.jar:3.5.1]
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>  ~[spark-core_2.12-3.5.1.jar:3.5.1]
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>  ~[spark-core_2.12-3.5.1.jar:3.5.1]
>   at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) 
> ~[spark-core_2.12-3.5.1.jar:3.5.1]
>   at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) 
> ~[spark-core_2.12-3.5.1.jar:3.5.1]
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105)
>  ~[spark-core_2.12-3.5.1.jar:3.5.1]
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105)
>  ~[spark-core_2.12-3.5.1.jar:3.5.1]
>   at 
> scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) 
> ~[scala-library-2.12.18.jar:?]
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) 
> ~[scala-library-2.12.18.jar:?]
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100)
>  ~[spark-core_2.12-3.5.1.jar:3.5.1]
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96)
>  ~[spark-core_2.12-3.5.1.jar:3.5.1]
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1356) 
> [spark-core_2.12-3.5.1.jar:3.5.1]
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)
>  [spark-core_2.12-3.5.1.jar:3.5.1]
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47441) Do not add log link for unmanaged AM in Spark UI

2024-03-18 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-47441:
---

 Summary: Do not add log link for unmanaged AM in Spark UI
 Key: SPARK-47441
 URL: https://issues.apache.org/jira/browse/SPARK-47441
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 3.5.1, 3.5.0
Reporter: Yuming Wang


{noformat}
24/03/18 04:58:25,022 ERROR [spark-listener-group-appStatus] 
scheduler.AsyncEventQueue:97 : Listener AppStatusListener threw an exception
java.lang.NumberFormatException: For input string: "null"
at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:67) 
~[?:?]
at java.lang.Integer.parseInt(Integer.java:668) ~[?:?]
at java.lang.Integer.parseInt(Integer.java:786) ~[?:?]
at scala.collection.immutable.StringLike.toInt(StringLike.scala:310) 
~[scala-library-2.12.18.jar:?]
at scala.collection.immutable.StringLike.toInt$(StringLike.scala:310) 
~[scala-library-2.12.18.jar:?]
at scala.collection.immutable.StringOps.toInt(StringOps.scala:33) 
~[scala-library-2.12.18.jar:?]
at org.apache.spark.util.Utils$.parseHostPort(Utils.scala:1105) 
~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.status.ProcessSummaryWrapper.(storeTypes.scala:609) 
~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.status.LiveMiscellaneousProcess.doUpdate(LiveEntity.scala:1045)
 ~[spark-core_2.12-3.5.1.jar:3.5.1]
at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:50) 
~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.status.AppStatusListener.update(AppStatusListener.scala:1233) 
~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.status.AppStatusListener.onMiscellaneousProcessAdded(AppStatusListener.scala:1445)
 ~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.status.AppStatusListener.onOtherEvent(AppStatusListener.scala:113)
 ~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:100)
 ~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
 ~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
 ~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
 ~[spark-core_2.12-3.5.1.jar:3.5.1]
at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) 
~[spark-core_2.12-3.5.1.jar:3.5.1]
at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) 
~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105)
 ~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105)
 ~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) 
~[scala-library-2.12.18.jar:?]
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) 
~[scala-library-2.12.18.jar:?]
at 
org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100)
 ~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96)
 ~[spark-core_2.12-3.5.1.jar:3.5.1]
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1356) 
[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)
 [spark-core_2.12-3.5.1.jar:3.5.1]
{noformat}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47440) SQLServer does not support LIKE operator in binary comparison

2024-03-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47440:
---
Labels: pull-request-available  (was: )

> SQLServer does not support LIKE operator in binary comparison
> -
>
> Key: SPARK-47440
> URL: https://issues.apache.org/jira/browse/SPARK-47440
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Stefan Bukorovic
>Priority: Major
>  Labels: pull-request-available
>
> When pushing Spark query to MsSqlServer engine we sometimes construct SQL 
> query that has a LIKE operator as a part of the binary comparison operation, 
> which is not permitted in SQL Server syntax. 
> For example a query 
> {code:java}
> SELECT * FROM people WHERE (name LIKE "s%") = 1{code}
> will not execute on MsSQLServer.
> These queries should be detected and not pushed to execution in MsSqlServer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47440) SQLServer does not support LIKE operator in binary comparison

2024-03-18 Thread Stefan Bukorovic (Jira)

Stefan Bukorovic created SPARK-47440:


 Summary: SQLServer does not support LIKE operator in binary 
comparison
 Key: SPARK-47440
 URL: https://issues.apache.org/jira/browse/SPARK-47440
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Stefan Bukorovic


When pushing Spark query to MsSqlServer engine we sometimes construct SQL query 
that has a LIKE operator as a part of the binary comparison operation, which is 
not permitted in SQL Server syntax. 

For example a query 
{code:java}
SELECT * FROM people WHERE (name LIKE "s%") = 1{code}
will not execute on MsSQLServer.

These queries should be detected and not pushed to execution in MsSqlServer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47422) Support collated strings in array operations

2024-03-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47422:
---
Labels: pull-request-available  (was: )

> Support collated strings in array operations
> 
>
> Key: SPARK-47422
> URL: https://issues.apache.org/jira/browse/SPARK-47422
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Priority: Major
>  Labels: pull-request-available
>
> Collations need to be properly supported in following array operations but 
> currently yield unexpected results: ArraysOverlap, ArrayDistinct, ArrayUnion, 
> ArrayIntersect, ArrayExcept. Example query:
> {code:java}
> select array_contains(array('aaa' collate utf8_binary_lcase), 'AAA' collate 
> utf8_binary_lcase){code}
> We would expect the result of query to be true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47438) Upgrade jackson to 2.17.0

2024-03-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47438:
---
Labels: pull-request-available  (was: )

> Upgrade jackson to 2.17.0
> -
>
> Key: SPARK-47438
> URL: https://issues.apache.org/jira/browse/SPARK-47438
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47437) Correct the error class for `DataFrame.sort`

2024-03-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-47437.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45559
[https://github.com/apache/spark/pull/45559]

> Correct the error class for `DataFrame.sort`
> 
>
> Key: SPARK-47437
> URL: https://issues.apache.org/jira/browse/SPARK-47437
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47437) Correct the error class for `DataFrame.sort`

2024-03-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-47437:


Assignee: Ruifeng Zheng

> Correct the error class for `DataFrame.sort`
> 
>
> Key: SPARK-47437
> URL: https://issues.apache.org/jira/browse/SPARK-47437
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43435) re-enable doctest `pyspark.sql.connect.dataframe.DataFrame.writeStream`

2024-03-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43435.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45560
[https://github.com/apache/spark/pull/45560]

> re-enable doctest `pyspark.sql.connect.dataframe.DataFrame.writeStream`
> ---
>
> Key: SPARK-43435
> URL: https://issues.apache.org/jira/browse/SPARK-43435
> Project: Spark
>  Issue Type: Test
>  Components: Connect, Tests
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47439) Document Python Data Source API in API reference page

2024-03-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-47439.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45561
[https://github.com/apache/spark/pull/45561]

> Document Python Data Source API in API reference page
> -
>
> Key: SPARK-47439
> URL: https://issues.apache.org/jira/browse/SPARK-47439
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47439) Document Python Data Source API in API reference page

2024-03-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-47439:


Assignee: Hyukjin Kwon

> Document Python Data Source API in API reference page
> -
>
> Key: SPARK-47439
> URL: https://issues.apache.org/jira/browse/SPARK-47439
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47439) Document Python Data Source API in API reference page

2024-03-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47439:
--

Assignee: Apache Spark

> Document Python Data Source API in API reference page
> -
>
> Key: SPARK-47439
> URL: https://issues.apache.org/jira/browse/SPARK-47439
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47439) Document Python Data Source API in API reference page

2024-03-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47439:
--

Assignee: (was: Apache Spark)

> Document Python Data Source API in API reference page
> -
>
> Key: SPARK-47439
> URL: https://issues.apache.org/jira/browse/SPARK-47439
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47439) Document Python Data Source API in API reference page

2024-03-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47439:
--

Assignee: Apache Spark

> Document Python Data Source API in API reference page
> -
>
> Key: SPARK-47439
> URL: https://issues.apache.org/jira/browse/SPARK-47439
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47439) Document Python Data Source API in API reference page

2024-03-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47439:
--

Assignee: (was: Apache Spark)

> Document Python Data Source API in API reference page
> -
>
> Key: SPARK-47439
> URL: https://issues.apache.org/jira/browse/SPARK-47439
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47439) Document Python Data Source API in API reference page

2024-03-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47439:
---
Labels: pull-request-available  (was: )

> Document Python Data Source API in API reference page
> -
>
> Key: SPARK-47439
> URL: https://issues.apache.org/jira/browse/SPARK-47439
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43435) re-enable doctest `pyspark.sql.connect.dataframe.DataFrame.writeStream`

2024-03-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-43435:
---
Labels: pull-request-available  (was: )

> re-enable doctest `pyspark.sql.connect.dataframe.DataFrame.writeStream`
> ---
>
> Key: SPARK-43435
> URL: https://issues.apache.org/jira/browse/SPARK-43435
> Project: Spark
>  Issue Type: Test
>  Components: Connect, Tests
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47439) Document Python Data Source API in API reference page

2024-03-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-47439:
-
Summary: Document Python Data Source API in API reference page  (was: 
Document Python Data Source API)

> Document Python Data Source API in API reference page
> -
>
> Key: SPARK-47439
> URL: https://issues.apache.org/jira/browse/SPARK-47439
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47439) Document Python Data Source API

2024-03-18 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-47439:


 Summary: Document Python Data Source API
 Key: SPARK-47439
 URL: https://issues.apache.org/jira/browse/SPARK-47439
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 4.0.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 102 matches

Mail list logo