[jira] [Updated] (SPARK-47248) contains (non-binary collations)

2024-03-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47248:
---
Labels: pull-request-available  (was: )

> contains (non-binary collations)
> 
>
> Key: SPARK-47248
> URL: https://issues.apache.org/jira/browse/SPARK-47248
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>
> Implemented efficient collation-aware in-place substring comparison to enable 
> collation support for: {_}contains{_}. Spark SQL users should now be able to 
> use COLLATE within arguments for built-in string function: CONTAINS in Spark 
> SQL queries (for non-binary collations).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47146) Possible thread leak when doing sort merge join

2024-03-04 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-47146:
---

Assignee: JacobZheng

> Possible thread leak when doing sort merge join
> ---
>
> Key: SPARK-47146
> URL: https://issues.apache.org/jira/browse/SPARK-47146
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.3.0, 3.4.0
>Reporter: JacobZheng
>Assignee: JacobZheng
>Priority: Critical
>  Labels: pull-request-available
>
> I have a long-running spark job. stumbled upon executor taking up a lot of 
> threads, resulting in no threads available on the server. Querying thread 
> details via jstack, there are tons of threads named read-ahead. Checking the 
> code confirms that these threads are created by ReadAheadInputStream. This 
> class is initialized to create a single-threaded thread pool
> {code:java}
> private final ExecutorService executorService =
> ThreadUtils.newDaemonSingleThreadExecutor("read-ahead"); {code}
> This thread pool is closed by ReadAheadInputStream#close(). 
> The call stack for the normal case close() method is
> {code:java}
> ts=2024-02-21 17:36:18;thread_name=Executor task launch worker for task 60.0 
> in stage 71.0 (TID 
> 258);id=330;is_daemon=true;priority=5;TCCL=org.apache.spark.util.MutableURLClassLoader@17233230
>     @org.apache.spark.io.ReadAheadInputStream.close()
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.close(UnsafeSorterSpillReader.java:149)
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:121)
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$1.loadNext(UnsafeSorterSpillMerger.java:87)
>         at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.advanceNext(UnsafeExternalRowSorter.java:187)
>         at 
> org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:67)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage27.processNext(null:-1)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.smj_findNextJoinRows_0$(null:-1)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.hashAgg_doAggregateWithKeys_1$(null:-1)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.hashAgg_doAggregateWithKeys_0$(null:-1)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.processNext(null:-1)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:779)
>         at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
>         at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
>         at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>         at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>         at org.apache.spark.scheduler.Task.run(Task.scala:139)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
>         at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at java.lang.Thread.run(Thread.java:829) {code}
> As shown in UnsafeSorterSpillReader#close, the stream is only closed when the 
> data in the stream is read through.
> {code:java}
> @Override
> public void loadNext() throws IOException {
>   // Kill the task in case it has been marked as killed. This logic is from
>   // InterruptibleIterator, but we inline it here instead of wrapping the 
> iterator in order
>   // to avoid performance overhead. This check is added here in `loadNext()` 
> instead of in
>   // `hasNext()` 

[jira] [Resolved] (SPARK-47146) Possible thread leak when doing sort merge join

2024-03-04 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-47146.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45327
[https://github.com/apache/spark/pull/45327]

> Possible thread leak when doing sort merge join
> ---
>
> Key: SPARK-47146
> URL: https://issues.apache.org/jira/browse/SPARK-47146
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.3.0, 3.4.0
>Reporter: JacobZheng
>Assignee: JacobZheng
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> I have a long-running spark job. stumbled upon executor taking up a lot of 
> threads, resulting in no threads available on the server. Querying thread 
> details via jstack, there are tons of threads named read-ahead. Checking the 
> code confirms that these threads are created by ReadAheadInputStream. This 
> class is initialized to create a single-threaded thread pool
> {code:java}
> private final ExecutorService executorService =
> ThreadUtils.newDaemonSingleThreadExecutor("read-ahead"); {code}
> This thread pool is closed by ReadAheadInputStream#close(). 
> The call stack for the normal case close() method is
> {code:java}
> ts=2024-02-21 17:36:18;thread_name=Executor task launch worker for task 60.0 
> in stage 71.0 (TID 
> 258);id=330;is_daemon=true;priority=5;TCCL=org.apache.spark.util.MutableURLClassLoader@17233230
>     @org.apache.spark.io.ReadAheadInputStream.close()
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.close(UnsafeSorterSpillReader.java:149)
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:121)
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$1.loadNext(UnsafeSorterSpillMerger.java:87)
>         at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.advanceNext(UnsafeExternalRowSorter.java:187)
>         at 
> org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:67)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage27.processNext(null:-1)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.smj_findNextJoinRows_0$(null:-1)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.hashAgg_doAggregateWithKeys_1$(null:-1)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.hashAgg_doAggregateWithKeys_0$(null:-1)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.processNext(null:-1)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:779)
>         at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
>         at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
>         at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>         at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>         at org.apache.spark.scheduler.Task.run(Task.scala:139)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
>         at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at java.lang.Thread.run(Thread.java:829) {code}
> As shown in UnsafeSorterSpillReader#close, the stream is only closed when the 
> data in the stream is read through.
> {code:java}
> @Override
> public void loadNext() throws IOException {
>   // Kill the task in case it has been marked as killed. This logic is from
>   // InterruptibleIterator, but we inline it here instead of wrapping the 
> 

[jira] [Updated] (SPARK-47279) spark driver process hangs due to "unable to create new native thread"

2024-03-04 Thread TianyiMa (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TianyiMa updated SPARK-47279:
-
Description: 
we encounter that spark driver hangs for about 11 hours,  and finall killed by 
user. In the driver log there is an error log: 
{quote}16:42:40 151 ERROR (org.apache.spark.rpc.netty.Inbox:94) - An error 
happened while processing message in the inbox for CoarseGrainedScheduler
java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:719)
        at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
        at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
        at 
org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:769)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:745)
        at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:144)
        at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
        at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
        at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
        at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
        at 
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
{quote}
 

After detailed analysis, we found that, the driver submitted task 0.0 at 
"16:40:50" to executor 4, and executor 4 finished the task 0.0 at "16:42:39", 
then executor 4 sent results to the driver. But in the same time, there is not 
sufficient memory in the the server that running the driver, the driver "unable 
to create new native thread" to handle the successful result of task 0.0, then 
the driver think task 0.0 has not finished and waiting for the "missed result" 
forever.

 

driver submit task 0.0

!driver_submit_task.png!

 

executor 4 task 0.0

!executor_4.png!

 

  was:
we encounter that spark driver hangs for about 11 hours,  and finall killed by 
user. In the driver log there is an error log: 
{quote}16:42:40 151 ERROR (org.apache.spark.rpc.netty.Inbox:94) - An error 
happened while processing message in the inbox for CoarseGrainedScheduler
java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:719)
        at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
        at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
        at 
org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:769)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:745)
        at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:144)
        at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
        at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
        at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
        at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
        at 
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
{quote}
 

In detailed analysis, we found that, the driver submit a task 0.0 at "16:40:50" 
to executor 4, and executor 4 finished the task 0.0 at "16:42:39", then 
executor 4 sends result to the driver. But in the same time, there is not 
sufficient memory in the the server that running the driver, the driver "unable 
to create new native thread" to handle the successful 

[jira] [Updated] (SPARK-47279) spark driver process hangs due to "unable to create new native thread"

2024-03-04 Thread TianyiMa (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TianyiMa updated SPARK-47279:
-
Description: 
we encounter that spark driver hangs for about 11 hours,  and finall killed by 
user. In the driver log there is an error log: 
{quote}16:42:40 151 ERROR (org.apache.spark.rpc.netty.Inbox:94) - An error 
happened while processing message in the inbox for CoarseGrainedScheduler
java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:719)
        at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
        at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
        at 
org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:769)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:745)
        at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:144)
        at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
        at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
        at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
        at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
        at 
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
{quote}
 

In detailed analysis, we found that, the driver submit a task 0.0 at "16:40:50" 
to executor 4, and executor 4 finished the task 0.0 at "16:42:39", then 
executor 4 sends result to the driver. But in the same time, there is not 
sufficient memory in the the server that running the driver, the driver "unable 
to create new native thread" to handle the successful result of task 0.0, then 
the driver think task 0.0 has not finished and waiting for the "missed result" 
forever.

 

driver submit task 0.0

!driver_submit_task.png!

 

executor 4 task 0.0

!executor_4.png!

 

  was:
we encounter that spark driver hangs for about 11 hours,  and finall killed by 
user. In the driver log there is an error log: 
{quote}16:42:40 151 ERROR (org.apache.spark.rpc.netty.Inbox:94) - An error 
happened while processing message in the inbox for CoarseGrainedScheduler
java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:719)
        at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
        at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
        at 
org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:769)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:745)
        at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:144)
        at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
        at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
        at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
        at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
        at 
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
{quote}
 

In detailed analysis, we found that, the driver submit a task 0.0 at "16:40:50" 
to executor 4, and executor 4 finished the task 0.0 at "16:42:39", then 
executor 4 sends result to the driver. But in the same time, there is not 
sufficient memory in the the server that running the driver, the driver "unable 
to create new native thread" to handle the successful 

[jira] [Updated] (SPARK-47279) spark driver process hangs due to "unable to create new native thread"

2024-03-04 Thread TianyiMa (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TianyiMa updated SPARK-47279:
-
Attachment: executor_4.png

> spark driver process hangs due to "unable to create new native thread"
> --
>
> Key: SPARK-47279
> URL: https://issues.apache.org/jira/browse/SPARK-47279
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 3.1.1, 3.5.0
>Reporter: TianyiMa
>Priority: Major
> Attachments: driver_submit_task.png, executor_4.png
>
>
> we encounter that spark driver hangs for about 11 hours,  and finall killed 
> by user. In the driver log there is an error log: 
> {quote}16:42:40 151 ERROR (org.apache.spark.rpc.netty.Inbox:94) - An error 
> happened while processing message in the inbox for CoarseGrainedScheduler
> java.lang.OutOfMemoryError: unable to create new native thread
>         at java.lang.Thread.start0(Native Method)
>         at java.lang.Thread.start(Thread.java:719)
>         at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
>         at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
>         at 
> org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61)
>         at 
> org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:769)
>         at 
> org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:745)
>         at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:144)
>         at 
> org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
>         at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
>         at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>         at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
>         at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:750)
> {quote}
>  
> In detailed analysis, we found that, the driver submit a task 0.0 at 
> "16:40:50" to executor 4, and executor 4 finished the task 0.0 at "16:42:39", 
> then executor 4 sends result to the driver. But in the same time, there is 
> not sufficient memory in the the server that running the driver, the driver 
> "unable to create new native thread" to handle the successful result of task 
> 0.0, then the driver think task 0.0 has not finished and waiting for the 
> "missed result" forever.
>  
> driver submit task:
> !driver_submit_task.png!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47279) spark driver process hangs due to "unable to create new native thread"

2024-03-04 Thread TianyiMa (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TianyiMa updated SPARK-47279:
-
Description: 
we encounter that spark driver hangs for about 11 hours,  and finall killed by 
user. In the driver log there is an error log: 
{quote}16:42:40 151 ERROR (org.apache.spark.rpc.netty.Inbox:94) - An error 
happened while processing message in the inbox for CoarseGrainedScheduler
java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:719)
        at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
        at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
        at 
org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:769)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:745)
        at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:144)
        at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
        at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
        at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
        at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
        at 
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
{quote}
 

In detailed analysis, we found that, the driver submit a task 0.0 at "16:40:50" 
to executor 4, and executor 4 finished the task 0.0 at "16:42:39", then 
executor 4 sends result to the driver. But in the same time, there is not 
sufficient memory in the the server that running the driver, the driver "unable 
to create new native thread" to handle the successful result of task 0.0, then 
the driver think task 0.0 has not finished and waiting for the "missed result" 
forever.

 

driver submit task:

!driver_submit_task.png!

 

 

  was:
we encounter that spark driver hangs for about 11 hours,  and finall killed by 
user. In the driver log there is an error log: 
{quote}16:42:40 151 ERROR (org.apache.spark.rpc.netty.Inbox:94) - An error 
happened while processing message in the inbox for CoarseGrainedScheduler
java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:719)
        at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
        at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
        at 
org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:769)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:745)
        at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:144)
        at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
        at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
        at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
        at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
        at 
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
{quote}
 

In detailed analysis, we found that, the driver submit a task 0.0 at "16:40:50" 
to executor 4, and executor 4 finished the task 0.0 at "16:42:39", then 
executor 4 sends result to the driver. But in the same time, there is not 
sufficient memory in the the server that running the driver, the driver "unable 
to create new native thread" to handle the successful result of task 0.0, then 
the driver think task 

[jira] [Updated] (SPARK-47279) spark driver process hangs due to "unable to create new native thread"

2024-03-04 Thread TianyiMa (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TianyiMa updated SPARK-47279:
-
Description: 
we encounter that spark driver hangs for about 11 hours,  and finall killed by 
user. In the driver log there is an error log: 
{quote}16:42:40 151 ERROR (org.apache.spark.rpc.netty.Inbox:94) - An error 
happened while processing message in the inbox for CoarseGrainedScheduler
java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:719)
        at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
        at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
        at 
org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:769)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:745)
        at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:144)
        at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
        at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
        at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
        at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
        at 
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
{quote}
 

In detailed analysis, we found that, the driver submit a task 0.0 at "16:40:50" 
to executor 4, and executor 4 finished the task 0.0 at "16:42:39", then 
executor 4 sends result to the driver. But in the same time, there is not 
sufficient memory in the the server that running the driver, the driver "unable 
to create new native thread" to handle the successful result of task 0.0, then 
the driver think task 0.0 has not finished and waiting for the "missed result" 
forever.

 

driver submit task:

 

 

 

  was:
we encounter that spark driver hangs for about 11 hours,  and finall killed by 
user. In the driver log there is an error log: 
{quote}16:42:40 151 ERROR (org.apache.spark.rpc.netty.Inbox:94) - An error 
happened while processing message in the inbox for CoarseGrainedScheduler
java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:719)
        at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
        at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
        at 
org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:769)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:745)
        at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:144)
        at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
        at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
        at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
        at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
        at 
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
{quote}
 

In detailed analysis, we found that, the driver submit a task 0.0 at "16:40:50" 
to executor 4, and executor 4 finished the task 0.0 at "16:42:39", then 
executor 4 sends result to the driver. But in the same time, there is not 
sufficient memory in the the server that running the driver, the driver "unable 
to create new native thread" to handle the successful result of task 0.0, then 
the driver think task 0.0 has not finished 

[jira] [Updated] (SPARK-47279) spark driver process hangs due to "unable to create new native thread"

2024-03-04 Thread TianyiMa (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TianyiMa updated SPARK-47279:
-
Attachment: driver_submit_task.png

> spark driver process hangs due to "unable to create new native thread"
> --
>
> Key: SPARK-47279
> URL: https://issues.apache.org/jira/browse/SPARK-47279
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 3.1.1, 3.5.0
>Reporter: TianyiMa
>Priority: Major
> Attachments: driver_submit_task.png
>
>
> we encounter that spark driver hangs for about 11 hours,  and finall killed 
> by user. In the driver log there is an error log: 
> {quote}16:42:40 151 ERROR (org.apache.spark.rpc.netty.Inbox:94) - An error 
> happened while processing message in the inbox for CoarseGrainedScheduler
> java.lang.OutOfMemoryError: unable to create new native thread
>         at java.lang.Thread.start0(Native Method)
>         at java.lang.Thread.start(Thread.java:719)
>         at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
>         at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
>         at 
> org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61)
>         at 
> org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:769)
>         at 
> org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:745)
>         at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:144)
>         at 
> org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
>         at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
>         at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>         at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
>         at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:750)
> {quote}
>  
> In detailed analysis, we found that, the driver submit a task 0.0 at 
> "16:40:50" to executor 4, and executor 4 finished the task 0.0 at "16:42:39", 
> then executor 4 sends result to the driver. But in the same time, there is 
> not sufficient memory in the the server that running the driver, the driver 
> "unable to create new native thread" to handle the successful result of task 
> 0.0, then the driver think task 0.0 has not finished and waiting for the 
> "missed result" forever.
>  
> driver submit task:
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47278) Upgrade rocksdbjni to 8.11.3

2024-03-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47278:
---
Labels: pull-request-available  (was: )

> Upgrade rocksdbjni to 8.11.3
> 
>
> Key: SPARK-47278
> URL: https://issues.apache.org/jira/browse/SPARK-47278
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47279) spark driver process hangs due to "unable to create new native thread"

2024-03-04 Thread TianyiMa (Jira)
TianyiMa created SPARK-47279:


 Summary: spark driver process hangs due to "unable to create new 
native thread"
 Key: SPARK-47279
 URL: https://issues.apache.org/jira/browse/SPARK-47279
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, Spark Core
Affects Versions: 3.5.0, 3.1.1
Reporter: TianyiMa


we encounter that spark driver hangs for about 11 hours,  and finall killed by 
user. In the driver log there is an error log: 
{quote}16:42:40 151 ERROR (org.apache.spark.rpc.netty.Inbox:94) - An error 
happened while processing message in the inbox for CoarseGrainedScheduler
java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:719)
        at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
        at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
        at 
org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:769)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:745)
        at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:144)
        at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
        at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
        at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
        at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
        at 
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
{quote}
 

In detailed analysis, we found that, the driver submit a task 0.0 at "16:40:50" 
to executor 4, and executor 4 finished the task 0.0 at "16:42:39", then 
executor 4 sends result to the driver. But in the same time, there is not 
sufficient memory in the the server that running the driver, the driver "unable 
to create new native thread" to handle the successful result of task 0.0, then 
the driver think task 0.0 has not finished and waiting for the "missed result" 
forever.

 

!image-2024-03-05-11-12-00-227.png!

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47278) Upgrade rocksdbjni to 8.11.3

2024-03-04 Thread Yang Jie (Jira)
Yang Jie created SPARK-47278:


 Summary: Upgrade rocksdbjni to 8.11.3
 Key: SPARK-47278
 URL: https://issues.apache.org/jira/browse/SPARK-47278
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 4.0.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47177) Cached SQL plan do not display final AQE plan in explain string

2024-03-04 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You resolved SPARK-47177.
---
Fix Version/s: 3.5.2
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 45282
[https://github.com/apache/spark/pull/45282]

> Cached SQL plan do not display final AQE plan in explain string
> ---
>
> Key: SPARK-47177
> URL: https://issues.apache.org/jira/browse/SPARK-47177
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.2, 3.5.0, 4.0.0, 3.5.1, 3.5.2
>Reporter: Ziqi Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.2, 4.0.0
>
>
> AQE plan is expected to display final plan after execution. This is not true 
> for cached SQL plan: it will show the initial plan instead. This behavior 
> change is introduced in [https://github.com/apache/spark/pull/40812] it tried 
> to fix the concurrency issue with cached plan. 
> *In short, the plan used to executed and the plan used to explain is not the 
> same instance, thus causing the inconsistency.*
>  
> I don't have a clear idea how yet
>  * maybe we just a coarse granularity lock in explain?
>  * make innerChildren a function: clone the initial plan, every time checked 
> for whether the original AQE plan is finalized (making the final flag atomic 
> first, of course), if no return the cloned initial plan, if it's finalized, 
> clone the final plan and return that one. But still this won't be able to 
> reflect the AQE plan in real time, in a concurrent situation, but at least we 
> have initial version and final version.
>  
> A simple repro:
> {code:java}
> d1 = spark.range(1000).withColumn("key", expr("id % 
> 100")).groupBy("key").agg({"key": "count"})
> cached_d2 = d1.cache()
> df = cached_d2.filter("key > 10")
> df.collect() {code}
> {code:java}
> >>> df.explain()
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=true
> +- == Final Plan ==
>    *(1) Filter (isnotnull(key#4L) AND (key#4L > 10))
>    +- TableCacheQueryStage 0
>       +- InMemoryTableScan [key#4L, count(key)#10L], [isnotnull(key#4L), 
> (key#4L > 10)]
>             +- InMemoryRelation [key#4L, count(key)#10L], StorageLevel(disk, 
> memory, deserialized, 1 replicas)
>                   +- AdaptiveSparkPlan isFinalPlan=false
>                      +- HashAggregate(keys=[key#4L], 
> functions=[count(key#4L)])
>                         +- Exchange hashpartitioning(key#4L, 200), 
> ENSURE_REQUIREMENTS, [plan_id=24]
>                            +- HashAggregate(keys=[key#4L], 
> functions=[partial_count(key#4L)])
>                               +- Project [(id#2L % 100) AS key#4L]
>                                  +- Range (0, 1000, step=1, splits=10)
> +- == Initial Plan ==
>    Filter (isnotnull(key#4L) AND (key#4L > 10))
>    +- InMemoryTableScan [key#4L, count(key)#10L], [isnotnull(key#4L), (key#4L 
> > 10)]
>          +- InMemoryRelation [key#4L, count(key)#10L], StorageLevel(disk, 
> memory, deserialized, 1 replicas)
>                +- AdaptiveSparkPlan isFinalPlan=false
>                   +- HashAggregate(keys=[key#4L], functions=[count(key#4L)])
>                      +- Exchange hashpartitioning(key#4L, 200), 
> ENSURE_REQUIREMENTS, [plan_id=24]
>                         +- HashAggregate(keys=[key#4L], 
> functions=[partial_count(key#4L)])
>                            +- Project [(id#2L % 100) AS key#4L]
>                               +- Range (0, 1000, step=1, splits=10){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47177) Cached SQL plan do not display final AQE plan in explain string

2024-03-04 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You reassigned SPARK-47177:
-

Assignee: XiDuo You

> Cached SQL plan do not display final AQE plan in explain string
> ---
>
> Key: SPARK-47177
> URL: https://issues.apache.org/jira/browse/SPARK-47177
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.2, 3.5.0, 4.0.0, 3.5.1, 3.5.2
>Reporter: Ziqi Liu
>Assignee: XiDuo You
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2
>
>
> AQE plan is expected to display final plan after execution. This is not true 
> for cached SQL plan: it will show the initial plan instead. This behavior 
> change is introduced in [https://github.com/apache/spark/pull/40812] it tried 
> to fix the concurrency issue with cached plan. 
> *In short, the plan used to executed and the plan used to explain is not the 
> same instance, thus causing the inconsistency.*
>  
> I don't have a clear idea how yet
>  * maybe we just a coarse granularity lock in explain?
>  * make innerChildren a function: clone the initial plan, every time checked 
> for whether the original AQE plan is finalized (making the final flag atomic 
> first, of course), if no return the cloned initial plan, if it's finalized, 
> clone the final plan and return that one. But still this won't be able to 
> reflect the AQE plan in real time, in a concurrent situation, but at least we 
> have initial version and final version.
>  
> A simple repro:
> {code:java}
> d1 = spark.range(1000).withColumn("key", expr("id % 
> 100")).groupBy("key").agg({"key": "count"})
> cached_d2 = d1.cache()
> df = cached_d2.filter("key > 10")
> df.collect() {code}
> {code:java}
> >>> df.explain()
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=true
> +- == Final Plan ==
>    *(1) Filter (isnotnull(key#4L) AND (key#4L > 10))
>    +- TableCacheQueryStage 0
>       +- InMemoryTableScan [key#4L, count(key)#10L], [isnotnull(key#4L), 
> (key#4L > 10)]
>             +- InMemoryRelation [key#4L, count(key)#10L], StorageLevel(disk, 
> memory, deserialized, 1 replicas)
>                   +- AdaptiveSparkPlan isFinalPlan=false
>                      +- HashAggregate(keys=[key#4L], 
> functions=[count(key#4L)])
>                         +- Exchange hashpartitioning(key#4L, 200), 
> ENSURE_REQUIREMENTS, [plan_id=24]
>                            +- HashAggregate(keys=[key#4L], 
> functions=[partial_count(key#4L)])
>                               +- Project [(id#2L % 100) AS key#4L]
>                                  +- Range (0, 1000, step=1, splits=10)
> +- == Initial Plan ==
>    Filter (isnotnull(key#4L) AND (key#4L > 10))
>    +- InMemoryTableScan [key#4L, count(key)#10L], [isnotnull(key#4L), (key#4L 
> > 10)]
>          +- InMemoryRelation [key#4L, count(key)#10L], StorageLevel(disk, 
> memory, deserialized, 1 replicas)
>                +- AdaptiveSparkPlan isFinalPlan=false
>                   +- HashAggregate(keys=[key#4L], functions=[count(key#4L)])
>                      +- Exchange hashpartitioning(key#4L, 200), 
> ENSURE_REQUIREMENTS, [plan_id=24]
>                         +- HashAggregate(keys=[key#4L], 
> functions=[partial_count(key#4L)])
>                            +- Project [(id#2L % 100) AS key#4L]
>                               +- Range (0, 1000, step=1, splits=10){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47277) PySpark util function assertDataFrameEqual should not support streaming DF

2024-03-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47277:
---
Labels: pull-request-available  (was: )

> PySpark util function assertDataFrameEqual should not support streaming DF
> --
>
> Key: SPARK-47277
> URL: https://issues.apache.org/jira/browse/SPARK-47277
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, PySpark, SQL, Structured Streaming
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
>Reporter: Wei Liu
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47277) PySpark util function assertDataFrameEqual should not support streaming DF

2024-03-04 Thread Wei Liu (Jira)
Wei Liu created SPARK-47277:
---

 Summary: PySpark util function assertDataFrameEqual should not 
support streaming DF
 Key: SPARK-47277
 URL: https://issues.apache.org/jira/browse/SPARK-47277
 Project: Spark
  Issue Type: New Feature
  Components: Connect, PySpark, SQL, Structured Streaming
Affects Versions: 3.5.1, 3.5.0, 4.0.0
Reporter: Wei Liu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47272) MapState Implementation for State V2

2024-03-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47272:
---
Labels: pull-request-available  (was: )

> MapState Implementation for State V2
> 
>
> Key: SPARK-47272
> URL: https://issues.apache.org/jira/browse/SPARK-47272
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Jing Zhan
>Priority: Major
>  Labels: pull-request-available
>
> This task adds changes for MapState implementation in State Api v2. This 
> implementation adds a new encoder/decoder to encode grouping key and user key 
> into a composite key to be put into RocksDB so that we could retrieve 
> key-value pair by user specified user key by one rocksdb get.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47276) Introduce `spark.profile.clear` for SparkSession-based profiling

2024-03-04 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-47276:


 Summary: Introduce `spark.profile.clear` for SparkSession-based 
profiling
 Key: SPARK-47276
 URL: https://issues.apache.org/jira/browse/SPARK-47276
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 4.0.0
Reporter: Xinrong Meng


Introduce `spark.profile.clear` for SparkSession-based profiling



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47275) XML: Change to not support DROPMALFORMED parse mode

2024-03-04 Thread Yousof Hosny (Jira)
Yousof Hosny created SPARK-47275:


 Summary: XML: Change to not support DROPMALFORMED parse mode
 Key: SPARK-47275
 URL: https://issues.apache.org/jira/browse/SPARK-47275
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Yousof Hosny


Change XML expressions to not support DROPMALFORMED parse mode. This matches 
JSON expressions which also do not support it. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47252) Clarify that pivot may trigger an eager computation

2024-03-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-47252.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45363
[https://github.com/apache/spark/pull/45363]

> Clarify that pivot may trigger an eager computation
> ---
>
> Key: SPARK-47252
> URL: https://issues.apache.org/jira/browse/SPARK-47252
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47252) Clarify that pivot may trigger an eager computation

2024-03-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-47252:


Assignee: Nicholas Chammas

> Clarify that pivot may trigger an eager computation
> ---
>
> Key: SPARK-47252
> URL: https://issues.apache.org/jira/browse/SPARK-47252
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47274) Provide more useful context for PySpark DataFrame API errors

2024-03-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47274:
---
Labels: pull-request-available  (was: )

> Provide more useful context for PySpark DataFrame API errors
> 
>
> Key: SPARK-47274
> URL: https://issues.apache.org/jira/browse/SPARK-47274
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
>
> Errors originating from PySpark operations can be difficult to debug with 
> limited context in the error messages. While improvements on the JVM side 
> have been made to offer detailed error contexts, PySpark errors often lack 
> this level of detail. Adding detailed context about the location within the 
> user's PySpark code where the error occurred will help debuggability for 
> PySpark users.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47274) Provide more useful context for PySpark DataFrame API errors

2024-03-04 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-47274:
---

 Summary: Provide more useful context for PySpark DataFrame API 
errors
 Key: SPARK-47274
 URL: https://issues.apache.org/jira/browse/SPARK-47274
 Project: Spark
  Issue Type: Bug
  Components: Connect, PySpark
Affects Versions: 4.0.0
Reporter: Haejoon Lee


Errors originating from PySpark operations can be difficult to debug with 
limited context in the error messages. While improvements on the JVM side have 
been made to offer detailed error contexts, PySpark errors often lack this 
level of detail. Adding detailed context about the location within the user's 
PySpark code where the error occurred will help debuggability for PySpark users.

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47273) Implement python stream writer interface

2024-03-04 Thread Chaoqin Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chaoqin Li updated SPARK-47273:
---
Description: 
In order to support developing spark streaming sink in python, we need to 
implement python stream writer interface.

Reuse PythonPartitionWriter to implement the serialization and execution of 
write callback in executor.

Implement python worker process to run python streaming data sink committer and 
communicate with JVM through socket in spark driver. For each python streaming 
data sink instance there will be a long live python worker process created. 
Inside the python process, the python write committer will receive abort or 
commit function call and send back result through socket.

  was:
In order to support developing spark streaming sink in python, we need to 
implement 

Reuse PythonPartitionWriter to implement the serialization and execution of 
write callback in executor.

Implement python worker process to run python streaming data sink committer and 
communicate with JVM through socket in spark driver. For each python streaming 
data sink instance there will be a long live python worker process created. 
Inside the python process, the python write committer will receive abort or 
commit function call and send back result through socket.


> Implement python stream writer interface
> 
>
> Key: SPARK-47273
> URL: https://issues.apache.org/jira/browse/SPARK-47273
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SS
>Affects Versions: 4.0.0
>Reporter: Chaoqin Li
>Priority: Major
>
> In order to support developing spark streaming sink in python, we need to 
> implement python stream writer interface.
> Reuse PythonPartitionWriter to implement the serialization and execution of 
> write callback in executor.
> Implement python worker process to run python streaming data sink committer 
> and communicate with JVM through socket in spark driver. For each python 
> streaming data sink instance there will be a long live python worker process 
> created. Inside the python process, the python write committer will receive 
> abort or commit function call and send back result through socket.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47273) Implement python stream writer interface

2024-03-04 Thread Chaoqin Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chaoqin Li updated SPARK-47273:
---
Description: 
In order to support developing spark streaming sink in python, we need to 
implement 

Reuse PythonPartitionWriter to implement the serialization and execution of 
write callback in executor.

Implement python worker process to run python streaming data sink committer and 
communicate with JVM through socket in spark driver. For each python streaming 
data sink instance there will be a long live python worker process created. 
Inside the python process, the python write committer will receive abort or 
commit function call and send back result through socket.

  was:
In order to support developing spark streaming sink in python, we need to 
implement

Reuse PythonPartitionWriter to implement the serialization and execution of 
write callback in executor.

Implement python worker process to run python streaming data sink committer and 
communicate with JVM through socket in spark driver. For each python streaming 
data sink instance there will be a long live python worker process created. 
Inside the python process, the python write committer will receive abort or 
commit function call and send back result through socket.


> Implement python stream writer interface
> 
>
> Key: SPARK-47273
> URL: https://issues.apache.org/jira/browse/SPARK-47273
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SS
>Affects Versions: 4.0.0
>Reporter: Chaoqin Li
>Priority: Major
>
> In order to support developing spark streaming sink in python, we need to 
> implement 
> Reuse PythonPartitionWriter to implement the serialization and execution of 
> write callback in executor.
> Implement python worker process to run python streaming data sink committer 
> and communicate with JVM through socket in spark driver. For each python 
> streaming data sink instance there will be a long live python worker process 
> created. Inside the python process, the python write committer will receive 
> abort or commit function call and send back result through socket.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47273) Implement python stream writer interface

2024-03-04 Thread Chaoqin Li (Jira)
Chaoqin Li created SPARK-47273:
--

 Summary: Implement python stream writer interface
 Key: SPARK-47273
 URL: https://issues.apache.org/jira/browse/SPARK-47273
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SS
Affects Versions: 4.0.0
Reporter: Chaoqin Li


In order to support developing spark streaming sink in python, we need to 
implement

Reuse PythonPartitionWriter to implement the serialization and execution of 
write callback in executor.

Implement python worker process to run python streaming data sink committer and 
communicate with JVM through socket in spark driver. For each python streaming 
data sink instance there will be a long live python worker process created. 
Inside the python process, the python write committer will receive abort or 
commit function call and send back result through socket.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36691) PythonRunner failed should pass error message to ApplicationMaster too

2024-03-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-36691:
---
Labels: pull-request-available  (was: )

> PythonRunner failed should pass error message to ApplicationMaster too
> --
>
> Key: SPARK-36691
> URL: https://issues.apache.org/jira/browse/SPARK-36691
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, YARN
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>  Labels: pull-request-available
>
> In current pyspark, stderr and stdout are print together, if py-script 
> failed, will only throw a SparkUserAppsException with exit code 1. Then pass 
> this error to AM.
> In cluster mode, client side only show 
> {code}
> User application exited with 1.
> {code}
> Without correct error message.  Then user need to find why their job failed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-46778) get_json_object flattens wildcard queries that match a single value

2024-03-04 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17823371#comment-17823371
 ] 

Pablo Langa Blanco commented on SPARK-46778:


I was looking at it and I found a comment in the code that explain why this 
behavior 
([https://github.com/apache/spark/blob/35bced42474e3221cf61d13a142c3c5470df1f22/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L377])
 

There are some tests around the code that test it and I reproduced it in hive 
3.1.3 and it still maintains this behavior so I don't know if we can change it.

> get_json_object flattens wildcard queries that match a single value
> ---
>
> Key: SPARK-46778
> URL: https://issues.apache.org/jira/browse/SPARK-46778
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Robert Joseph Evans
>Priority: Major
>
> I think this impacts all versions of {{{}get_json_object{}}}, but I am not 
> 100% sure.
> The unit test for 
> [$.store.book[*].reader|https://github.com/apache/spark/blob/39f8e1a5953b5897f893151d24dc585a80c0c8a0/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala#L142-L146]
>  verifies that the output of this query is a single level JSON array, but 
> when I put the same JSON and JSON path into [http://jsonpath.com/] I get a 
> result with multiple levels of nesting. It looks like Apache Spark tries to 
> flatten lists for {{[*]}} matches when there is only a single element that 
> matches.
> {code:java}
> scala> 
> Seq("""[{"a":"A"},{"b":"B"}]""","""[{"a":"A"},{"a":"B"}]""").toDF("jsonStr").selectExpr("""get_json_object(jsonStr,"$[*].a")""").show(false)
> ++
> |get_json_object(jsonStr, $[*].a)|
> ++
> |"A"                             |
> |["A","B"]                       |
> ++ {code}
> But this has problems in that I no longer have a consistent schema returned, 
> even if the input schema is known to be consistent. For example if I wanted 
> to know how many elements matched this query I could wrap it in a 
> {{json_array_length}} but that does not work in the generic case.
> {code:java}
> scala> 
> Seq("""[{"a":"A"},{"b":"B"}]""","""[{"a":"A"},{"a":"B"}]""").toDF("jsonStr").selectExpr("""json_array_length(get_json_object(jsonStr,"$[*].a"))""").show(false)
> +---+
> |json_array_length(get_json_object(jsonStr, $[*].a))|
> +---+
> |null                                               |
> |2                                                  |
> +---+ {code}
> If the value returned might be a JSON array, and then I would get a number, 
> but it is wrong.
> {code:java}
> scala> 
> Seq("""[{"a":[1,2,3,4,5]},{"b":"B"}]""","""[{"a":[1,2,3,4,5]},{"a":[1,2,3,4,5]}]""").toDF("jsonStr").selectExpr("""json_array_length(get_json_object(jsonStr,"$[*].a"))""").show(false)
> +---+
> |json_array_length(get_json_object(jsonStr, $[*].a))|
> +---+
> |5                                                  |
> |2                                                  |
> +---+ {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33152) SPIP: Constraint Propagation code causes OOM issues or increasing compilation time to hours

2024-03-04 Thread Asif (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17823344#comment-17823344
 ] 

Asif commented on SPARK-33152:
--

[~tedjenks]  The issue has always been there  because of the way constraint 
prop rule works ( due to it permutational logic). A possible cause why it might 
have become more common could be due to some changes to fix the previously 
undetected constraints . The more robust the code becomes in detecting the 
constraints,  chances are it would increase the cost of over all constraints 
code drastically.

For eg if we have a projection with multiple aliases, and say these aliases are 
used in case when expressions and involve functions taking these aliases, the 
number of constraints created would be enormous and even then the code ( 
atleast in 3.2) would not be able to cover all the possible constraints..

so my guess is that any changes to increase the sensitivity of constraints 
identification will affect the cost of the evaluation of constraints..

> SPIP: Constraint Propagation code causes OOM issues or increasing compilation 
> time to hours
> ---
>
> Key: SPARK-33152
> URL: https://issues.apache.org/jira/browse/SPARK-33152
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Asif
>Priority: Major
>  Labels: SPIP
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> h2. Q1. What are you trying to do? Articulate your objectives using 
> absolutely no jargon.
> Proposing new algorithm to create, store and use constraints for removing 
> redundant filters & inferring new filters.
> The current algorithm has subpar performance in complex expression scenarios 
> involving aliases( with certain use cases the compilation time can go into 
> hours), potential to cause OOM, may miss removing redundant filters in 
> different scenarios, may miss creating IsNotNull constraints in different 
> scenarios, does not push compound predicates in Join.
>  # This issue if not fixed can cause OutOfMemory issue or unacceptable query 
> compilation times.
> Have added a test "plan equivalence with case statements and performance 
> comparison with benefit of more than 10x conservatively" in 
> org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite. 
> *With this PR the compilation time is 247 ms vs 13958 ms without the change*
>  # It is more effective in filter pruning as is evident in some of the tests 
> in org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite 
> where current code is not able to identify the redundant filter in some cases.
>  # It is able to generate a better optimized plan for join queries as it can 
> push compound predicates.
>  # The current logic can miss a lot of possible cases of removing redundant 
> predicates, as it fails to take into account if same attribute or its aliases 
> are repeated multiple times in a complex expression.
>  # There are cases where some of the optimizer rules involving removal of 
> redundant predicates fail to remove on the basis of constraint data. In some 
> cases the rule works, just by the virtue of previous rules helping it out to 
> cover the inaccuracy. That the ConstraintPropagation rule & its function of 
> removal of redundant filters & addition of new inferred filters is dependent 
> on the working of some of the other unrelated previous optimizer rules is 
> behaving, is indicative of issues.
>  # It does away with all the EqualNullSafe constraints as this logic does not 
> need those constraints to be created.
>  # There is at least one test in existing ConstraintPropagationSuite which is 
> missing a IsNotNull constraints because the code incorrectly generated a 
> EqualsNullSafeConstraint instead of EqualTo constraint, when using the 
> existing Constraints code. With these changes, the test correctly creates an 
> EqualTo constraint, resulting in an inferred IsNotNull constraint
>  # It does away with the current combinatorial logic of evaluation all the 
> constraints can cause compilation to run into hours or cause OOM. The number 
> of constraints stored is exactly the same as the number of filters encountered
> h2. Q2. What problem is this proposal NOT designed to solve?
> It mainly focuses on compile time performance, but in some cases can benefit 
> run time characteristics too, like inferring IsNotNull filter or pushing down 
> compound predicates on the join, which currently may get missed/ does not 
> happen , respectively, by the present code.
> h2. Q3. How is it done today, and what are the limits of current practice?
> Current ConstraintsPropagation code, pessimistically tries to generates all 
> the possible combinations of constraints , based on 

[jira] [Updated] (SPARK-44746) Improve the documentation for TABLE input arguments for UDTFs

2024-03-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-44746:
---
Labels: pull-request-available  (was: )

> Improve the documentation for TABLE input arguments for UDTFs
> -
>
> Key: SPARK-44746
> URL: https://issues.apache.org/jira/browse/SPARK-44746
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>  Labels: pull-request-available
>
> We should add more examples for using Python UDTFs with TABLE arguments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47032) Create API for 'analyze' method to send input column(s) to output table unchanged

2024-03-04 Thread Daniel (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel resolved SPARK-47032.

Resolution: Won't Fix

After investigating this, we found that any performance benefits from skipping 
sending column values from the JVM to the Python worker were obviated by extra 
manipulation of values on the JVM side to forward them to the output table. As 
a result, we are closing this.

> Create API for 'analyze' method to send input column(s) to output table 
> unchanged
> -
>
> Key: SPARK-47032
> URL: https://issues.apache.org/jira/browse/SPARK-47032
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Daniel
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47272) MapState Implementation for State V2

2024-03-04 Thread Jing Zhan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhan updated SPARK-47272:
--
  Epic Link: SPARK-46815
Description: This task adds changes for MapState implementation in State 
Api v2. This implementation adds a new encoder/decoder to encode grouping key 
and user key into a composite key to be put into RocksDB so that we could 
retrieve key-value pair by user specified user key by one rocksdb get.
 Issue Type: Task  (was: New Feature)

> MapState Implementation for State V2
> 
>
> Key: SPARK-47272
> URL: https://issues.apache.org/jira/browse/SPARK-47272
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Jing Zhan
>Priority: Major
>
> This task adds changes for MapState implementation in State Api v2. This 
> implementation adds a new encoder/decoder to encode grouping key and user key 
> into a composite key to be put into RocksDB so that we could retrieve 
> key-value pair by user specified user key by one rocksdb get.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47242) Bump ap-loader 3.0(v8) to support for async-profiler 3.0

2024-03-04 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47242:
--
Parent: SPARK-47046
Issue Type: Sub-task  (was: Improvement)

> Bump ap-loader 3.0(v8) to support for async-profiler 3.0
> 
>
> Key: SPARK-47242
> URL: https://issues.apache.org/jira/browse/SPARK-47242
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Nicholas Jiang
>Assignee: Nicholas Jiang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> ap-loader 3.0(v8) has already been released, which supports for async-profier 
> 3.0. The release guide refers to [Loader for 3.0 (v8): Binary launcher and 
> AsyncGetCallTrace 
> replacement|https://github.com/jvm-profiling-tools/ap-loader/releases/tag/3.0-8].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47242) Bump ap-loader 3.0(v8) to support for async-profiler 3.0

2024-03-04 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47242.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45351
[https://github.com/apache/spark/pull/45351]

> Bump ap-loader 3.0(v8) to support for async-profiler 3.0
> 
>
> Key: SPARK-47242
> URL: https://issues.apache.org/jira/browse/SPARK-47242
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Nicholas Jiang
>Assignee: Nicholas Jiang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> ap-loader 3.0(v8) has already been released, which supports for async-profier 
> 3.0. The release guide refers to [Loader for 3.0 (v8): Binary launcher and 
> AsyncGetCallTrace 
> replacement|https://github.com/jvm-profiling-tools/ap-loader/releases/tag/3.0-8].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47242) Bump ap-loader 3.0(v8) to support for async-profiler 3.0

2024-03-04 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47242:
-

Assignee: Nicholas Jiang

> Bump ap-loader 3.0(v8) to support for async-profiler 3.0
> 
>
> Key: SPARK-47242
> URL: https://issues.apache.org/jira/browse/SPARK-47242
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Nicholas Jiang
>Assignee: Nicholas Jiang
>Priority: Major
>  Labels: pull-request-available
>
> ap-loader 3.0(v8) has already been released, which supports for async-profier 
> 3.0. The release guide refers to [Loader for 3.0 (v8): Binary launcher and 
> AsyncGetCallTrace 
> replacement|https://github.com/jvm-profiling-tools/ap-loader/releases/tag/3.0-8].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33152) SPIP: Constraint Propagation code causes OOM issues or increasing compilation time to hours

2024-03-04 Thread Ted Chester Jenks (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17823314#comment-17823314
 ] 

Ted Chester Jenks commented on SPARK-33152:
---

[~cloud_fan] [~ashahid7] We have seen a huge uptick in these issues since 
upgrading from 3.2 to 3.4. It is not exactly clear why these became so 
prevalent but it is a big issue.

> SPIP: Constraint Propagation code causes OOM issues or increasing compilation 
> time to hours
> ---
>
> Key: SPARK-33152
> URL: https://issues.apache.org/jira/browse/SPARK-33152
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Asif
>Priority: Major
>  Labels: SPIP
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> h2. Q1. What are you trying to do? Articulate your objectives using 
> absolutely no jargon.
> Proposing new algorithm to create, store and use constraints for removing 
> redundant filters & inferring new filters.
> The current algorithm has subpar performance in complex expression scenarios 
> involving aliases( with certain use cases the compilation time can go into 
> hours), potential to cause OOM, may miss removing redundant filters in 
> different scenarios, may miss creating IsNotNull constraints in different 
> scenarios, does not push compound predicates in Join.
>  # This issue if not fixed can cause OutOfMemory issue or unacceptable query 
> compilation times.
> Have added a test "plan equivalence with case statements and performance 
> comparison with benefit of more than 10x conservatively" in 
> org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite. 
> *With this PR the compilation time is 247 ms vs 13958 ms without the change*
>  # It is more effective in filter pruning as is evident in some of the tests 
> in org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite 
> where current code is not able to identify the redundant filter in some cases.
>  # It is able to generate a better optimized plan for join queries as it can 
> push compound predicates.
>  # The current logic can miss a lot of possible cases of removing redundant 
> predicates, as it fails to take into account if same attribute or its aliases 
> are repeated multiple times in a complex expression.
>  # There are cases where some of the optimizer rules involving removal of 
> redundant predicates fail to remove on the basis of constraint data. In some 
> cases the rule works, just by the virtue of previous rules helping it out to 
> cover the inaccuracy. That the ConstraintPropagation rule & its function of 
> removal of redundant filters & addition of new inferred filters is dependent 
> on the working of some of the other unrelated previous optimizer rules is 
> behaving, is indicative of issues.
>  # It does away with all the EqualNullSafe constraints as this logic does not 
> need those constraints to be created.
>  # There is at least one test in existing ConstraintPropagationSuite which is 
> missing a IsNotNull constraints because the code incorrectly generated a 
> EqualsNullSafeConstraint instead of EqualTo constraint, when using the 
> existing Constraints code. With these changes, the test correctly creates an 
> EqualTo constraint, resulting in an inferred IsNotNull constraint
>  # It does away with the current combinatorial logic of evaluation all the 
> constraints can cause compilation to run into hours or cause OOM. The number 
> of constraints stored is exactly the same as the number of filters encountered
> h2. Q2. What problem is this proposal NOT designed to solve?
> It mainly focuses on compile time performance, but in some cases can benefit 
> run time characteristics too, like inferring IsNotNull filter or pushing down 
> compound predicates on the join, which currently may get missed/ does not 
> happen , respectively, by the present code.
> h2. Q3. How is it done today, and what are the limits of current practice?
> Current ConstraintsPropagation code, pessimistically tries to generates all 
> the possible combinations of constraints , based on the aliases ( even then 
> it may miss a lot of combinations if the expression is a complex expression 
> involving same attribute repeated multiple times within the expression and 
> there are many aliases to that column). There are query plans in our 
> production env, which can result in intermediate number of constraints going 
> into hundreds of thousands, causing OOM or taking time running into hours. 
> Also there are cases where it incorrectly generates an EqualNullSafe 
> constraint instead of EqualTo constraint , thus missing a possible IsNull 
> constraint on column. 
> Also it only pushes single column predicate on the other side of the 

[jira] [Updated] (SPARK-47269) Upgrade jetty to 11.0.20

2024-03-04 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47269:
--
Parent: SPARK-47046
Issue Type: Sub-task  (was: Improvement)

> Upgrade jetty to 11.0.20
> 
>
> Key: SPARK-47269
> URL: https://issues.apache.org/jira/browse/SPARK-47269
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> fix 
>  * [CVE-2024-22201|https://github.com/advisories/GHSA-rggv-cv7r-mw98] - 
> HTTP/2 connection not closed after idle timeout when TCP congested



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47269) Upgrade jetty to 11.0.20

2024-03-04 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47269:
-

Assignee: Yang Jie

> Upgrade jetty to 11.0.20
> 
>
> Key: SPARK-47269
> URL: https://issues.apache.org/jira/browse/SPARK-47269
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>
> fix 
>  * [CVE-2024-22201|https://github.com/advisories/GHSA-rggv-cv7r-mw98] - 
> HTTP/2 connection not closed after idle timeout when TCP congested



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47269) Upgrade jetty to 11.0.20

2024-03-04 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47269.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45371
[https://github.com/apache/spark/pull/45371]

> Upgrade jetty to 11.0.20
> 
>
> Key: SPARK-47269
> URL: https://issues.apache.org/jira/browse/SPARK-47269
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> fix 
>  * [CVE-2024-22201|https://github.com/advisories/GHSA-rggv-cv7r-mw98] - 
> HTTP/2 connection not closed after idle timeout when TCP congested



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46961) Adding processorHandle as a Context Variable

2024-03-04 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-46961:


Assignee: Eric Marnadi

> Adding processorHandle as a Context Variable
> 
>
> Key: SPARK-46961
> URL: https://issues.apache.org/jira/browse/SPARK-46961
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Eric Marnadi
>Assignee: Eric Marnadi
>Priority: Major
>  Labels: pull-request-available
>
> Instead of passing the StatefulProcessorHandle to the user in `init`, instead 
> embed it as a context variable, ProcessorContext, that the user can fetch



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46961) Adding processorHandle as a Context Variable

2024-03-04 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-46961.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45359
[https://github.com/apache/spark/pull/45359]

> Adding processorHandle as a Context Variable
> 
>
> Key: SPARK-46961
> URL: https://issues.apache.org/jira/browse/SPARK-46961
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Eric Marnadi
>Assignee: Eric Marnadi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Instead of passing the StatefulProcessorHandle to the user in `init`, instead 
> embed it as a context variable, ProcessorContext, that the user can fetch



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47271) Explain importance of statistics on SQL performance tuning page

2024-03-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47271:
---
Labels: pull-request-available  (was: )

> Explain importance of statistics on SQL performance tuning page
> ---
>
> Key: SPARK-47271
> URL: https://issues.apache.org/jira/browse/SPARK-47271
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47268) Repartition by column should respect collation

2024-03-04 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-47268:
---

Assignee: Aleksandar Tomic

> Repartition by column should respect collation
> --
>
> Key: SPARK-47268
> URL: https://issues.apache.org/jira/browse/SPARK-47268
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Aleksandar Tomic
>Assignee: Aleksandar Tomic
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47268) Repartition by column should respect collation

2024-03-04 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47268.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45370
[https://github.com/apache/spark/pull/45370]

> Repartition by column should respect collation
> --
>
> Key: SPARK-47268
> URL: https://issues.apache.org/jira/browse/SPARK-47268
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Aleksandar Tomic
>Assignee: Aleksandar Tomic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47271) Explain importance of statistics on SQL performance tuning page

2024-03-04 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-47271:


 Summary: Explain importance of statistics on SQL performance 
tuning page
 Key: SPARK-47271
 URL: https://issues.apache.org/jira/browse/SPARK-47271
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 4.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47070) Subquery rewrite inside an aggregation makes an aggregation invalid

2024-03-04 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47070.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

> Subquery rewrite inside an aggregation makes an aggregation invalid
> ---
>
> Key: SPARK-47070
> URL: https://issues.apache.org/jira/browse/SPARK-47070
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Anton Lykov
>Assignee: Anton Kirillov
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> When an in/exists-subquery appears inside an aggregate expression within a 
> top-level GROUP BY, it gets rewritten and a new `exists` variable is 
> introduced. However, this variable is incorrectly handled in aggregation. For 
> example, consider the following query:
> ```
> SELECT
> CASE
> WHEN t1.id IN (SELECT id FROM t2) THEN 10
> ELSE -10
> END AS v1
> FROM t1
> GROUP BY t1.id;
> ```
>  
> Executing it leads to the following error:
> ```
> java.lang.IllegalArgumentException: Cannot find column index for attribute 
> 'exists#844' in: Map()
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47131) contains, startswith, endswith

2024-03-04 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47131.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45216
[https://github.com/apache/spark/pull/45216]

> contains, startswith, endswith
> --
>
> Key: SPARK-47131
> URL: https://issues.apache.org/jira/browse/SPARK-47131
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Refactored built-in string functions to enable collation support for: 
> {_}contains{_}, {_}startsWith{_}, {_}endsWith{_}. Spark SQL users should now 
> be able to use COLLATE within arguments for built-in string functions: 
> CONTAINS, STARTSWITH, ENDSWITH in Spark SQL queries. Note: CONTAINS 
> implementation for non-binary collations is a separate subtask (SPARK-47248).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47131) contains, startswith, endswith

2024-03-04 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-47131:
---

Assignee: Uroš Bojanić

> contains, startswith, endswith
> --
>
> Key: SPARK-47131
> URL: https://issues.apache.org/jira/browse/SPARK-47131
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>
> Refactored built-in string functions to enable collation support for: 
> {_}contains{_}, {_}startsWith{_}, {_}endsWith{_}. Spark SQL users should now 
> be able to use COLLATE within arguments for built-in string functions: 
> CONTAINS, STARTSWITH, ENDSWITH in Spark SQL queries. Note: CONTAINS 
> implementation for non-binary collations is a separate subtask (SPARK-47248).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47270) Dataset.isEmpty should not trigger job execution on CommandResults

2024-03-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47270:
---
Labels: pull-request-available  (was: )

> Dataset.isEmpty should not trigger job execution on CommandResults
> --
>
> Key: SPARK-47270
> URL: https://issues.apache.org/jira/browse/SPARK-47270
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Zhen Wang
>Priority: Major
>  Labels: pull-request-available
>
> Similar to [SPARK-43124|https://issues.apache.org/jira/browse/SPARK-43124], 
> Dataset.isEmpty should also not trigger job execution on CommandResults.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47270) Dataset.isEmpty should not trigger job execution on CommandResults

2024-03-04 Thread Zhen Wang (Jira)
Zhen Wang created SPARK-47270:
-

 Summary: Dataset.isEmpty should not trigger job execution on 
CommandResults
 Key: SPARK-47270
 URL: https://issues.apache.org/jira/browse/SPARK-47270
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Zhen Wang


Similar to [SPARK-43124|https://issues.apache.org/jira/browse/SPARK-43124], 
Dataset.isEmpty should also not trigger job execution on CommandResults.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47265) Use `createTable(..., schema: StructType, ...)` instead of `createTable(..., columns: Array[Column], ...)` in UT

2024-03-04 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-47265:

Summary: Use `createTable(..., schema: StructType, ...)` instead of 
`createTable(..., columns: Array[Column], ...)` in UT  (was: Enable test of v2 
data sources in `FileBasedDataSourceSuite`)

> Use `createTable(..., schema: StructType, ...)` instead of `createTable(..., 
> columns: Array[Column], ...)` in UT
> 
>
> Key: SPARK-47265
> URL: https://issues.apache.org/jira/browse/SPARK-47265
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47269) Upgrade jetty to 11.0.20

2024-03-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47269:
---
Labels: pull-request-available  (was: )

> Upgrade jetty to 11.0.20
> 
>
> Key: SPARK-47269
> URL: https://issues.apache.org/jira/browse/SPARK-47269
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>
> fix 
>  * [CVE-2024-22201|https://github.com/advisories/GHSA-rggv-cv7r-mw98] - 
> HTTP/2 connection not closed after idle timeout when TCP congested



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47269) Upgrade jetty to 11.0.20

2024-03-04 Thread Yang Jie (Jira)
Yang Jie created SPARK-47269:


 Summary: Upgrade jetty to 11.0.20
 Key: SPARK-47269
 URL: https://issues.apache.org/jira/browse/SPARK-47269
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: Yang Jie


fix 
 * [CVE-2024-22201|https://github.com/advisories/GHSA-rggv-cv7r-mw98] - HTTP/2 
connection not closed after idle timeout when TCP congested



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47268) Repartition by column should respect collation

2024-03-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47268:
---
Labels: pull-request-available  (was: )

> Repartition by column should respect collation
> --
>
> Key: SPARK-47268
> URL: https://issues.apache.org/jira/browse/SPARK-47268
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Aleksandar Tomic
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43258) Assign a name to the error class _LEGACY_ERROR_TEMP_202[3,4,5]

2024-03-04 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-43258.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45288
[https://github.com/apache/spark/pull/45288]

> Assign a name to the error class _LEGACY_ERROR_TEMP_202[3,4,5]
> --
>
> Key: SPARK-43258
> URL: https://issues.apache.org/jira/browse/SPARK-43258
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Max Gekk
>Assignee: Deng Ziming
>Priority: Minor
>  Labels: pull-request-available, starter
> Fix For: 4.0.0
>
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2023* defined in 
> {*}core/src/main/resources/error/error-classes.json{*}. The name should be 
> short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43258) Assign a name to the error class _LEGACY_ERROR_TEMP_202[3,4,5]

2024-03-04 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-43258:


Assignee: Deng Ziming

> Assign a name to the error class _LEGACY_ERROR_TEMP_202[3,4,5]
> --
>
> Key: SPARK-43258
> URL: https://issues.apache.org/jira/browse/SPARK-43258
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Max Gekk
>Assignee: Deng Ziming
>Priority: Minor
>  Labels: pull-request-available, starter
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2023* defined in 
> {*}core/src/main/resources/error/error-classes.json{*}. The name should be 
> short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47268) Repartition by column should respect collation

2024-03-04 Thread Aleksandar Tomic (Jira)
Aleksandar Tomic created SPARK-47268:


 Summary: Repartition by column should respect collation
 Key: SPARK-47268
 URL: https://issues.apache.org/jira/browse/SPARK-47268
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Aleksandar Tomic






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47267) Hash functions should respect collation

2024-03-04 Thread Aleksandar Tomic (Jira)
Aleksandar Tomic created SPARK-47267:


 Summary: Hash functions should respect collation
 Key: SPARK-47267
 URL: https://issues.apache.org/jira/browse/SPARK-47267
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Aleksandar Tomic


All functions in `hash_funcs` group should respec collation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47131) contains, startswith, endswith

2024-03-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47131:
--

Assignee: (was: Apache Spark)

> contains, startswith, endswith
> --
>
> Key: SPARK-47131
> URL: https://issues.apache.org/jira/browse/SPARK-47131
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>
> Refactored built-in string functions to enable collation support for: 
> {_}contains{_}, {_}startsWith{_}, {_}endsWith{_}. Spark SQL users should now 
> be able to use COLLATE within arguments for built-in string functions: 
> CONTAINS, STARTSWITH, ENDSWITH in Spark SQL queries. Note: CONTAINS 
> implementation for non-binary collations is a separate subtask (SPARK-47248).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47131) contains, startswith, endswith

2024-03-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47131:
--

Assignee: Apache Spark

> contains, startswith, endswith
> --
>
> Key: SPARK-47131
> URL: https://issues.apache.org/jira/browse/SPARK-47131
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> Refactored built-in string functions to enable collation support for: 
> {_}contains{_}, {_}startsWith{_}, {_}endsWith{_}. Spark SQL users should now 
> be able to use COLLATE within arguments for built-in string functions: 
> CONTAINS, STARTSWITH, ENDSWITH in Spark SQL queries. Note: CONTAINS 
> implementation for non-binary collations is a separate subtask (SPARK-47248).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47131) contains, startswith, endswith

2024-03-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47131:
--

Assignee: (was: Apache Spark)

> contains, startswith, endswith
> --
>
> Key: SPARK-47131
> URL: https://issues.apache.org/jira/browse/SPARK-47131
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>
> Refactored built-in string functions to enable collation support for: 
> {_}contains{_}, {_}startsWith{_}, {_}endsWith{_}. Spark SQL users should now 
> be able to use COLLATE within arguments for built-in string functions: 
> CONTAINS, STARTSWITH, ENDSWITH in Spark SQL queries. Note: CONTAINS 
> implementation for non-binary collations is a separate subtask (SPARK-47248).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47131) contains, startswith, endswith

2024-03-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47131:
--

Assignee: Apache Spark

> contains, startswith, endswith
> --
>
> Key: SPARK-47131
> URL: https://issues.apache.org/jira/browse/SPARK-47131
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> Refactored built-in string functions to enable collation support for: 
> {_}contains{_}, {_}startsWith{_}, {_}endsWith{_}. Spark SQL users should now 
> be able to use COLLATE within arguments for built-in string functions: 
> CONTAINS, STARTSWITH, ENDSWITH in Spark SQL queries. Note: CONTAINS 
> implementation for non-binary collations is a separate subtask (SPARK-47248).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47070) Subquery rewrite inside an aggregation makes an aggregation invalid

2024-03-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47070:
--

Assignee: (was: Apache Spark)

> Subquery rewrite inside an aggregation makes an aggregation invalid
> ---
>
> Key: SPARK-47070
> URL: https://issues.apache.org/jira/browse/SPARK-47070
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Anton Lykov
>Priority: Major
>  Labels: pull-request-available
>
> When an in/exists-subquery appears inside an aggregate expression within a 
> top-level GROUP BY, it gets rewritten and a new `exists` variable is 
> introduced. However, this variable is incorrectly handled in aggregation. For 
> example, consider the following query:
> ```
> SELECT
> CASE
> WHEN t1.id IN (SELECT id FROM t2) THEN 10
> ELSE -10
> END AS v1
> FROM t1
> GROUP BY t1.id;
> ```
>  
> Executing it leads to the following error:
> ```
> java.lang.IllegalArgumentException: Cannot find column index for attribute 
> 'exists#844' in: Map()
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47070) Subquery rewrite inside an aggregation makes an aggregation invalid

2024-03-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47070:
--

Assignee: Apache Spark

> Subquery rewrite inside an aggregation makes an aggregation invalid
> ---
>
> Key: SPARK-47070
> URL: https://issues.apache.org/jira/browse/SPARK-47070
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Anton Lykov
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> When an in/exists-subquery appears inside an aggregate expression within a 
> top-level GROUP BY, it gets rewritten and a new `exists` variable is 
> introduced. However, this variable is incorrectly handled in aggregation. For 
> example, consider the following query:
> ```
> SELECT
> CASE
> WHEN t1.id IN (SELECT id FROM t2) THEN 10
> ELSE -10
> END AS v1
> FROM t1
> GROUP BY t1.id;
> ```
>  
> Executing it leads to the following error:
> ```
> java.lang.IllegalArgumentException: Cannot find column index for attribute 
> 'exists#844' in: Map()
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47266) Make `ProtoUtils.abbreviate` return the same type as the input

2024-03-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-47266.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45369
[https://github.com/apache/spark/pull/45369]

> Make `ProtoUtils.abbreviate` return the same type as the input
> --
>
> Key: SPARK-47266
> URL: https://issues.apache.org/jira/browse/SPARK-47266
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47253) Allow LiveEventBus to stop without the completely draining of event queue

2024-03-04 Thread TakawaAkirayo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TakawaAkirayo updated SPARK-47253:
--
Affects Version/s: 4.0.0
   (was: 3.5.0)

> Allow LiveEventBus to stop without the completely draining of event queue
> -
>
> Key: SPARK-47253
> URL: https://issues.apache.org/jira/browse/SPARK-47253
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: TakawaAkirayo
>Priority: Minor
>  Labels: pull-request-available
>
> #Problem statement:
> The SparkContext.stop() hung a long time on LiveEventBus.stop() when 
> listeners slow
> #User scenarios:
> We have a centralized service with multiple instances to regularly execute 
> user's scheduled tasks.
> For each user task within one service instance, the process is as follows:
> 1.Create a Spark session directly within the service process with an account 
> defined in the task.
> 2.Instantiate listeners by class names and register them with the 
> SparkContext. The JARs containing the listener classes are uploaded to the 
> service by the user.
> 3.Prepare resources.
> 4.Run user logic (Spark SQL).
> 5.Stop the Spark session by invoking SparkSession.stop().
> In step 5, it will wait for the LiveEventBus to stop, which requires the 
> remaining events to be completely drained by each listener.
> Since the listener is implemented by users and we cannot prevent some heavy 
> stuffs within the listener on each event, there are cases where a single 
> heavy job has over 30,000 tasks,
> and it could take 30 minutes for the listener to process all the remaining 
> events, because within the listener, it requires a coarse-grained global lock 
> and update the internal status to the remote database.
> This kind of delay affects other user tasks in the queue. Therefore, from the 
> server side perspective, we need the guarantee that the stop operation 
> finishes quickly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47266) Make `ProtoUtils.abbreviate` return the same type as the input

2024-03-04 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-47266:
-

 Summary: Make `ProtoUtils.abbreviate` return the same type as the 
input
 Key: SPARK-47266
 URL: https://issues.apache.org/jira/browse/SPARK-47266
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org