[jira] [Updated] (SPARK-47248) contains (non-binary collations)
[ https://issues.apache.org/jira/browse/SPARK-47248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47248: --- Labels: pull-request-available (was: ) > contains (non-binary collations) > > > Key: SPARK-47248 > URL: https://issues.apache.org/jira/browse/SPARK-47248 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > Labels: pull-request-available > > Implemented efficient collation-aware in-place substring comparison to enable > collation support for: {_}contains{_}. Spark SQL users should now be able to > use COLLATE within arguments for built-in string function: CONTAINS in Spark > SQL queries (for non-binary collations). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47146) Possible thread leak when doing sort merge join
[ https://issues.apache.org/jira/browse/SPARK-47146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan reassigned SPARK-47146: --- Assignee: JacobZheng > Possible thread leak when doing sort merge join > --- > > Key: SPARK-47146 > URL: https://issues.apache.org/jira/browse/SPARK-47146 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0, 3.3.0, 3.4.0 >Reporter: JacobZheng >Assignee: JacobZheng >Priority: Critical > Labels: pull-request-available > > I have a long-running spark job. stumbled upon executor taking up a lot of > threads, resulting in no threads available on the server. Querying thread > details via jstack, there are tons of threads named read-ahead. Checking the > code confirms that these threads are created by ReadAheadInputStream. This > class is initialized to create a single-threaded thread pool > {code:java} > private final ExecutorService executorService = > ThreadUtils.newDaemonSingleThreadExecutor("read-ahead"); {code} > This thread pool is closed by ReadAheadInputStream#close(). > The call stack for the normal case close() method is > {code:java} > ts=2024-02-21 17:36:18;thread_name=Executor task launch worker for task 60.0 > in stage 71.0 (TID > 258);id=330;is_daemon=true;priority=5;TCCL=org.apache.spark.util.MutableURLClassLoader@17233230 > @org.apache.spark.io.ReadAheadInputStream.close() > at > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.close(UnsafeSorterSpillReader.java:149) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:121) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$1.loadNext(UnsafeSorterSpillMerger.java:87) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.advanceNext(UnsafeExternalRowSorter.java:187) > at > org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:67) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage27.processNext(null:-1) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.smj_findNextJoinRows_0$(null:-1) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.hashAgg_doAggregateWithKeys_1$(null:-1) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.hashAgg_doAggregateWithKeys_0$(null:-1) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.processNext(null:-1) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:779) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140) > at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at > org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) > at org.apache.spark.scheduler.Task.run(Task.scala:139) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.lang.Thread.run(Thread.java:829) {code} > As shown in UnsafeSorterSpillReader#close, the stream is only closed when the > data in the stream is read through. > {code:java} > @Override > public void loadNext() throws IOException { > // Kill the task in case it has been marked as killed. This logic is from > // InterruptibleIterator, but we inline it here instead of wrapping the > iterator in order > // to avoid performance overhead. This check is added here in `loadNext()` > instead of in > // `hasNext()`
[jira] [Resolved] (SPARK-47146) Possible thread leak when doing sort merge join
[ https://issues.apache.org/jira/browse/SPARK-47146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan resolved SPARK-47146. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45327 [https://github.com/apache/spark/pull/45327] > Possible thread leak when doing sort merge join > --- > > Key: SPARK-47146 > URL: https://issues.apache.org/jira/browse/SPARK-47146 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0, 3.3.0, 3.4.0 >Reporter: JacobZheng >Assignee: JacobZheng >Priority: Critical > Labels: pull-request-available > Fix For: 4.0.0 > > > I have a long-running spark job. stumbled upon executor taking up a lot of > threads, resulting in no threads available on the server. Querying thread > details via jstack, there are tons of threads named read-ahead. Checking the > code confirms that these threads are created by ReadAheadInputStream. This > class is initialized to create a single-threaded thread pool > {code:java} > private final ExecutorService executorService = > ThreadUtils.newDaemonSingleThreadExecutor("read-ahead"); {code} > This thread pool is closed by ReadAheadInputStream#close(). > The call stack for the normal case close() method is > {code:java} > ts=2024-02-21 17:36:18;thread_name=Executor task launch worker for task 60.0 > in stage 71.0 (TID > 258);id=330;is_daemon=true;priority=5;TCCL=org.apache.spark.util.MutableURLClassLoader@17233230 > @org.apache.spark.io.ReadAheadInputStream.close() > at > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.close(UnsafeSorterSpillReader.java:149) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:121) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$1.loadNext(UnsafeSorterSpillMerger.java:87) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.advanceNext(UnsafeExternalRowSorter.java:187) > at > org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:67) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage27.processNext(null:-1) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.smj_findNextJoinRows_0$(null:-1) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.hashAgg_doAggregateWithKeys_1$(null:-1) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.hashAgg_doAggregateWithKeys_0$(null:-1) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.processNext(null:-1) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:779) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140) > at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at > org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) > at org.apache.spark.scheduler.Task.run(Task.scala:139) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.lang.Thread.run(Thread.java:829) {code} > As shown in UnsafeSorterSpillReader#close, the stream is only closed when the > data in the stream is read through. > {code:java} > @Override > public void loadNext() throws IOException { > // Kill the task in case it has been marked as killed. This logic is from > // InterruptibleIterator, but we inline it here instead of wrapping the >
[jira] [Updated] (SPARK-47279) spark driver process hangs due to "unable to create new native thread"
[ https://issues.apache.org/jira/browse/SPARK-47279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] TianyiMa updated SPARK-47279: - Description: we encounter that spark driver hangs for about 11 hours, and finall killed by user. In the driver log there is an error log: {quote}16:42:40 151 ERROR (org.apache.spark.rpc.netty.Inbox:94) - An error happened while processing message in the inbox for CoarseGrainedScheduler java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:719) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) at org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61) at org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:769) at org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:745) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:144) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) {quote} After detailed analysis, we found that, the driver submitted task 0.0 at "16:40:50" to executor 4, and executor 4 finished the task 0.0 at "16:42:39", then executor 4 sent results to the driver. But in the same time, there is not sufficient memory in the the server that running the driver, the driver "unable to create new native thread" to handle the successful result of task 0.0, then the driver think task 0.0 has not finished and waiting for the "missed result" forever. driver submit task 0.0 !driver_submit_task.png! executor 4 task 0.0 !executor_4.png! was: we encounter that spark driver hangs for about 11 hours, and finall killed by user. In the driver log there is an error log: {quote}16:42:40 151 ERROR (org.apache.spark.rpc.netty.Inbox:94) - An error happened while processing message in the inbox for CoarseGrainedScheduler java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:719) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) at org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61) at org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:769) at org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:745) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:144) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) {quote} In detailed analysis, we found that, the driver submit a task 0.0 at "16:40:50" to executor 4, and executor 4 finished the task 0.0 at "16:42:39", then executor 4 sends result to the driver. But in the same time, there is not sufficient memory in the the server that running the driver, the driver "unable to create new native thread" to handle the successful
[jira] [Updated] (SPARK-47279) spark driver process hangs due to "unable to create new native thread"
[ https://issues.apache.org/jira/browse/SPARK-47279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] TianyiMa updated SPARK-47279: - Description: we encounter that spark driver hangs for about 11 hours, and finall killed by user. In the driver log there is an error log: {quote}16:42:40 151 ERROR (org.apache.spark.rpc.netty.Inbox:94) - An error happened while processing message in the inbox for CoarseGrainedScheduler java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:719) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) at org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61) at org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:769) at org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:745) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:144) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) {quote} In detailed analysis, we found that, the driver submit a task 0.0 at "16:40:50" to executor 4, and executor 4 finished the task 0.0 at "16:42:39", then executor 4 sends result to the driver. But in the same time, there is not sufficient memory in the the server that running the driver, the driver "unable to create new native thread" to handle the successful result of task 0.0, then the driver think task 0.0 has not finished and waiting for the "missed result" forever. driver submit task 0.0 !driver_submit_task.png! executor 4 task 0.0 !executor_4.png! was: we encounter that spark driver hangs for about 11 hours, and finall killed by user. In the driver log there is an error log: {quote}16:42:40 151 ERROR (org.apache.spark.rpc.netty.Inbox:94) - An error happened while processing message in the inbox for CoarseGrainedScheduler java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:719) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) at org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61) at org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:769) at org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:745) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:144) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) {quote} In detailed analysis, we found that, the driver submit a task 0.0 at "16:40:50" to executor 4, and executor 4 finished the task 0.0 at "16:42:39", then executor 4 sends result to the driver. But in the same time, there is not sufficient memory in the the server that running the driver, the driver "unable to create new native thread" to handle the successful
[jira] [Updated] (SPARK-47279) spark driver process hangs due to "unable to create new native thread"
[ https://issues.apache.org/jira/browse/SPARK-47279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] TianyiMa updated SPARK-47279: - Attachment: executor_4.png > spark driver process hangs due to "unable to create new native thread" > -- > > Key: SPARK-47279 > URL: https://issues.apache.org/jira/browse/SPARK-47279 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 3.1.1, 3.5.0 >Reporter: TianyiMa >Priority: Major > Attachments: driver_submit_task.png, executor_4.png > > > we encounter that spark driver hangs for about 11 hours, and finall killed > by user. In the driver log there is an error log: > {quote}16:42:40 151 ERROR (org.apache.spark.rpc.netty.Inbox:94) - An error > happened while processing message in the inbox for CoarseGrainedScheduler > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:719) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) > at > org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61) > at > org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:769) > at > org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:745) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:144) > at > org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > at > org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) > at > org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > {quote} > > In detailed analysis, we found that, the driver submit a task 0.0 at > "16:40:50" to executor 4, and executor 4 finished the task 0.0 at "16:42:39", > then executor 4 sends result to the driver. But in the same time, there is > not sufficient memory in the the server that running the driver, the driver > "unable to create new native thread" to handle the successful result of task > 0.0, then the driver think task 0.0 has not finished and waiting for the > "missed result" forever. > > driver submit task: > !driver_submit_task.png! > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47279) spark driver process hangs due to "unable to create new native thread"
[ https://issues.apache.org/jira/browse/SPARK-47279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] TianyiMa updated SPARK-47279: - Description: we encounter that spark driver hangs for about 11 hours, and finall killed by user. In the driver log there is an error log: {quote}16:42:40 151 ERROR (org.apache.spark.rpc.netty.Inbox:94) - An error happened while processing message in the inbox for CoarseGrainedScheduler java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:719) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) at org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61) at org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:769) at org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:745) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:144) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) {quote} In detailed analysis, we found that, the driver submit a task 0.0 at "16:40:50" to executor 4, and executor 4 finished the task 0.0 at "16:42:39", then executor 4 sends result to the driver. But in the same time, there is not sufficient memory in the the server that running the driver, the driver "unable to create new native thread" to handle the successful result of task 0.0, then the driver think task 0.0 has not finished and waiting for the "missed result" forever. driver submit task: !driver_submit_task.png! was: we encounter that spark driver hangs for about 11 hours, and finall killed by user. In the driver log there is an error log: {quote}16:42:40 151 ERROR (org.apache.spark.rpc.netty.Inbox:94) - An error happened while processing message in the inbox for CoarseGrainedScheduler java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:719) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) at org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61) at org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:769) at org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:745) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:144) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) {quote} In detailed analysis, we found that, the driver submit a task 0.0 at "16:40:50" to executor 4, and executor 4 finished the task 0.0 at "16:42:39", then executor 4 sends result to the driver. But in the same time, there is not sufficient memory in the the server that running the driver, the driver "unable to create new native thread" to handle the successful result of task 0.0, then the driver think task
[jira] [Updated] (SPARK-47279) spark driver process hangs due to "unable to create new native thread"
[ https://issues.apache.org/jira/browse/SPARK-47279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] TianyiMa updated SPARK-47279: - Description: we encounter that spark driver hangs for about 11 hours, and finall killed by user. In the driver log there is an error log: {quote}16:42:40 151 ERROR (org.apache.spark.rpc.netty.Inbox:94) - An error happened while processing message in the inbox for CoarseGrainedScheduler java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:719) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) at org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61) at org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:769) at org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:745) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:144) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) {quote} In detailed analysis, we found that, the driver submit a task 0.0 at "16:40:50" to executor 4, and executor 4 finished the task 0.0 at "16:42:39", then executor 4 sends result to the driver. But in the same time, there is not sufficient memory in the the server that running the driver, the driver "unable to create new native thread" to handle the successful result of task 0.0, then the driver think task 0.0 has not finished and waiting for the "missed result" forever. driver submit task: was: we encounter that spark driver hangs for about 11 hours, and finall killed by user. In the driver log there is an error log: {quote}16:42:40 151 ERROR (org.apache.spark.rpc.netty.Inbox:94) - An error happened while processing message in the inbox for CoarseGrainedScheduler java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:719) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) at org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61) at org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:769) at org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:745) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:144) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) {quote} In detailed analysis, we found that, the driver submit a task 0.0 at "16:40:50" to executor 4, and executor 4 finished the task 0.0 at "16:42:39", then executor 4 sends result to the driver. But in the same time, there is not sufficient memory in the the server that running the driver, the driver "unable to create new native thread" to handle the successful result of task 0.0, then the driver think task 0.0 has not finished
[jira] [Updated] (SPARK-47279) spark driver process hangs due to "unable to create new native thread"
[ https://issues.apache.org/jira/browse/SPARK-47279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] TianyiMa updated SPARK-47279: - Attachment: driver_submit_task.png > spark driver process hangs due to "unable to create new native thread" > -- > > Key: SPARK-47279 > URL: https://issues.apache.org/jira/browse/SPARK-47279 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 3.1.1, 3.5.0 >Reporter: TianyiMa >Priority: Major > Attachments: driver_submit_task.png > > > we encounter that spark driver hangs for about 11 hours, and finall killed > by user. In the driver log there is an error log: > {quote}16:42:40 151 ERROR (org.apache.spark.rpc.netty.Inbox:94) - An error > happened while processing message in the inbox for CoarseGrainedScheduler > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:719) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) > at > org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61) > at > org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:769) > at > org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:745) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:144) > at > org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > at > org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) > at > org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > {quote} > > In detailed analysis, we found that, the driver submit a task 0.0 at > "16:40:50" to executor 4, and executor 4 finished the task 0.0 at "16:42:39", > then executor 4 sends result to the driver. But in the same time, there is > not sufficient memory in the the server that running the driver, the driver > "unable to create new native thread" to handle the successful result of task > 0.0, then the driver think task 0.0 has not finished and waiting for the > "missed result" forever. > > driver submit task: > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47278) Upgrade rocksdbjni to 8.11.3
[ https://issues.apache.org/jira/browse/SPARK-47278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47278: --- Labels: pull-request-available (was: ) > Upgrade rocksdbjni to 8.11.3 > > > Key: SPARK-47278 > URL: https://issues.apache.org/jira/browse/SPARK-47278 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47279) spark driver process hangs due to "unable to create new native thread"
TianyiMa created SPARK-47279: Summary: spark driver process hangs due to "unable to create new native thread" Key: SPARK-47279 URL: https://issues.apache.org/jira/browse/SPARK-47279 Project: Spark Issue Type: Bug Components: Scheduler, Spark Core Affects Versions: 3.5.0, 3.1.1 Reporter: TianyiMa we encounter that spark driver hangs for about 11 hours, and finall killed by user. In the driver log there is an error log: {quote}16:42:40 151 ERROR (org.apache.spark.rpc.netty.Inbox:94) - An error happened while processing message in the inbox for CoarseGrainedScheduler java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:719) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) at org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61) at org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:769) at org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:745) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:144) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) {quote} In detailed analysis, we found that, the driver submit a task 0.0 at "16:40:50" to executor 4, and executor 4 finished the task 0.0 at "16:42:39", then executor 4 sends result to the driver. But in the same time, there is not sufficient memory in the the server that running the driver, the driver "unable to create new native thread" to handle the successful result of task 0.0, then the driver think task 0.0 has not finished and waiting for the "missed result" forever. !image-2024-03-05-11-12-00-227.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47278) Upgrade rocksdbjni to 8.11.3
Yang Jie created SPARK-47278: Summary: Upgrade rocksdbjni to 8.11.3 Key: SPARK-47278 URL: https://issues.apache.org/jira/browse/SPARK-47278 Project: Spark Issue Type: Sub-task Components: Build Affects Versions: 4.0.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47177) Cached SQL plan do not display final AQE plan in explain string
[ https://issues.apache.org/jira/browse/SPARK-47177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You resolved SPARK-47177. --- Fix Version/s: 3.5.2 4.0.0 Resolution: Fixed Issue resolved by pull request 45282 [https://github.com/apache/spark/pull/45282] > Cached SQL plan do not display final AQE plan in explain string > --- > > Key: SPARK-47177 > URL: https://issues.apache.org/jira/browse/SPARK-47177 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.2, 3.5.0, 4.0.0, 3.5.1, 3.5.2 >Reporter: Ziqi Liu >Priority: Major > Labels: pull-request-available > Fix For: 3.5.2, 4.0.0 > > > AQE plan is expected to display final plan after execution. This is not true > for cached SQL plan: it will show the initial plan instead. This behavior > change is introduced in [https://github.com/apache/spark/pull/40812] it tried > to fix the concurrency issue with cached plan. > *In short, the plan used to executed and the plan used to explain is not the > same instance, thus causing the inconsistency.* > > I don't have a clear idea how yet > * maybe we just a coarse granularity lock in explain? > * make innerChildren a function: clone the initial plan, every time checked > for whether the original AQE plan is finalized (making the final flag atomic > first, of course), if no return the cloned initial plan, if it's finalized, > clone the final plan and return that one. But still this won't be able to > reflect the AQE plan in real time, in a concurrent situation, but at least we > have initial version and final version. > > A simple repro: > {code:java} > d1 = spark.range(1000).withColumn("key", expr("id % > 100")).groupBy("key").agg({"key": "count"}) > cached_d2 = d1.cache() > df = cached_d2.filter("key > 10") > df.collect() {code} > {code:java} > >>> df.explain() > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=true > +- == Final Plan == > *(1) Filter (isnotnull(key#4L) AND (key#4L > 10)) > +- TableCacheQueryStage 0 > +- InMemoryTableScan [key#4L, count(key)#10L], [isnotnull(key#4L), > (key#4L > 10)] > +- InMemoryRelation [key#4L, count(key)#10L], StorageLevel(disk, > memory, deserialized, 1 replicas) > +- AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[key#4L], > functions=[count(key#4L)]) > +- Exchange hashpartitioning(key#4L, 200), > ENSURE_REQUIREMENTS, [plan_id=24] > +- HashAggregate(keys=[key#4L], > functions=[partial_count(key#4L)]) > +- Project [(id#2L % 100) AS key#4L] > +- Range (0, 1000, step=1, splits=10) > +- == Initial Plan == > Filter (isnotnull(key#4L) AND (key#4L > 10)) > +- InMemoryTableScan [key#4L, count(key)#10L], [isnotnull(key#4L), (key#4L > > 10)] > +- InMemoryRelation [key#4L, count(key)#10L], StorageLevel(disk, > memory, deserialized, 1 replicas) > +- AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[key#4L], functions=[count(key#4L)]) > +- Exchange hashpartitioning(key#4L, 200), > ENSURE_REQUIREMENTS, [plan_id=24] > +- HashAggregate(keys=[key#4L], > functions=[partial_count(key#4L)]) > +- Project [(id#2L % 100) AS key#4L] > +- Range (0, 1000, step=1, splits=10){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47177) Cached SQL plan do not display final AQE plan in explain string
[ https://issues.apache.org/jira/browse/SPARK-47177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You reassigned SPARK-47177: - Assignee: XiDuo You > Cached SQL plan do not display final AQE plan in explain string > --- > > Key: SPARK-47177 > URL: https://issues.apache.org/jira/browse/SPARK-47177 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.2, 3.5.0, 4.0.0, 3.5.1, 3.5.2 >Reporter: Ziqi Liu >Assignee: XiDuo You >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.2 > > > AQE plan is expected to display final plan after execution. This is not true > for cached SQL plan: it will show the initial plan instead. This behavior > change is introduced in [https://github.com/apache/spark/pull/40812] it tried > to fix the concurrency issue with cached plan. > *In short, the plan used to executed and the plan used to explain is not the > same instance, thus causing the inconsistency.* > > I don't have a clear idea how yet > * maybe we just a coarse granularity lock in explain? > * make innerChildren a function: clone the initial plan, every time checked > for whether the original AQE plan is finalized (making the final flag atomic > first, of course), if no return the cloned initial plan, if it's finalized, > clone the final plan and return that one. But still this won't be able to > reflect the AQE plan in real time, in a concurrent situation, but at least we > have initial version and final version. > > A simple repro: > {code:java} > d1 = spark.range(1000).withColumn("key", expr("id % > 100")).groupBy("key").agg({"key": "count"}) > cached_d2 = d1.cache() > df = cached_d2.filter("key > 10") > df.collect() {code} > {code:java} > >>> df.explain() > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=true > +- == Final Plan == > *(1) Filter (isnotnull(key#4L) AND (key#4L > 10)) > +- TableCacheQueryStage 0 > +- InMemoryTableScan [key#4L, count(key)#10L], [isnotnull(key#4L), > (key#4L > 10)] > +- InMemoryRelation [key#4L, count(key)#10L], StorageLevel(disk, > memory, deserialized, 1 replicas) > +- AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[key#4L], > functions=[count(key#4L)]) > +- Exchange hashpartitioning(key#4L, 200), > ENSURE_REQUIREMENTS, [plan_id=24] > +- HashAggregate(keys=[key#4L], > functions=[partial_count(key#4L)]) > +- Project [(id#2L % 100) AS key#4L] > +- Range (0, 1000, step=1, splits=10) > +- == Initial Plan == > Filter (isnotnull(key#4L) AND (key#4L > 10)) > +- InMemoryTableScan [key#4L, count(key)#10L], [isnotnull(key#4L), (key#4L > > 10)] > +- InMemoryRelation [key#4L, count(key)#10L], StorageLevel(disk, > memory, deserialized, 1 replicas) > +- AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[key#4L], functions=[count(key#4L)]) > +- Exchange hashpartitioning(key#4L, 200), > ENSURE_REQUIREMENTS, [plan_id=24] > +- HashAggregate(keys=[key#4L], > functions=[partial_count(key#4L)]) > +- Project [(id#2L % 100) AS key#4L] > +- Range (0, 1000, step=1, splits=10){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47277) PySpark util function assertDataFrameEqual should not support streaming DF
[ https://issues.apache.org/jira/browse/SPARK-47277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47277: --- Labels: pull-request-available (was: ) > PySpark util function assertDataFrameEqual should not support streaming DF > -- > > Key: SPARK-47277 > URL: https://issues.apache.org/jira/browse/SPARK-47277 > Project: Spark > Issue Type: New Feature > Components: Connect, PySpark, SQL, Structured Streaming >Affects Versions: 3.5.0, 4.0.0, 3.5.1 >Reporter: Wei Liu >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47277) PySpark util function assertDataFrameEqual should not support streaming DF
Wei Liu created SPARK-47277: --- Summary: PySpark util function assertDataFrameEqual should not support streaming DF Key: SPARK-47277 URL: https://issues.apache.org/jira/browse/SPARK-47277 Project: Spark Issue Type: New Feature Components: Connect, PySpark, SQL, Structured Streaming Affects Versions: 3.5.1, 3.5.0, 4.0.0 Reporter: Wei Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47272) MapState Implementation for State V2
[ https://issues.apache.org/jira/browse/SPARK-47272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47272: --- Labels: pull-request-available (was: ) > MapState Implementation for State V2 > > > Key: SPARK-47272 > URL: https://issues.apache.org/jira/browse/SPARK-47272 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Jing Zhan >Priority: Major > Labels: pull-request-available > > This task adds changes for MapState implementation in State Api v2. This > implementation adds a new encoder/decoder to encode grouping key and user key > into a composite key to be put into RocksDB so that we could retrieve > key-value pair by user specified user key by one rocksdb get. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47276) Introduce `spark.profile.clear` for SparkSession-based profiling
Xinrong Meng created SPARK-47276: Summary: Introduce `spark.profile.clear` for SparkSession-based profiling Key: SPARK-47276 URL: https://issues.apache.org/jira/browse/SPARK-47276 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 4.0.0 Reporter: Xinrong Meng Introduce `spark.profile.clear` for SparkSession-based profiling -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47275) XML: Change to not support DROPMALFORMED parse mode
Yousof Hosny created SPARK-47275: Summary: XML: Change to not support DROPMALFORMED parse mode Key: SPARK-47275 URL: https://issues.apache.org/jira/browse/SPARK-47275 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 4.0.0 Reporter: Yousof Hosny Change XML expressions to not support DROPMALFORMED parse mode. This matches JSON expressions which also do not support it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47252) Clarify that pivot may trigger an eager computation
[ https://issues.apache.org/jira/browse/SPARK-47252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-47252. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45363 [https://github.com/apache/spark/pull/45363] > Clarify that pivot may trigger an eager computation > --- > > Key: SPARK-47252 > URL: https://issues.apache.org/jira/browse/SPARK-47252 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 4.0.0 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47252) Clarify that pivot may trigger an eager computation
[ https://issues.apache.org/jira/browse/SPARK-47252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-47252: Assignee: Nicholas Chammas > Clarify that pivot may trigger an eager computation > --- > > Key: SPARK-47252 > URL: https://issues.apache.org/jira/browse/SPARK-47252 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 4.0.0 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47274) Provide more useful context for PySpark DataFrame API errors
[ https://issues.apache.org/jira/browse/SPARK-47274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47274: --- Labels: pull-request-available (was: ) > Provide more useful context for PySpark DataFrame API errors > > > Key: SPARK-47274 > URL: https://issues.apache.org/jira/browse/SPARK-47274 > Project: Spark > Issue Type: Bug > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > Labels: pull-request-available > > Errors originating from PySpark operations can be difficult to debug with > limited context in the error messages. While improvements on the JVM side > have been made to offer detailed error contexts, PySpark errors often lack > this level of detail. Adding detailed context about the location within the > user's PySpark code where the error occurred will help debuggability for > PySpark users. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47274) Provide more useful context for PySpark DataFrame API errors
Haejoon Lee created SPARK-47274: --- Summary: Provide more useful context for PySpark DataFrame API errors Key: SPARK-47274 URL: https://issues.apache.org/jira/browse/SPARK-47274 Project: Spark Issue Type: Bug Components: Connect, PySpark Affects Versions: 4.0.0 Reporter: Haejoon Lee Errors originating from PySpark operations can be difficult to debug with limited context in the error messages. While improvements on the JVM side have been made to offer detailed error contexts, PySpark errors often lack this level of detail. Adding detailed context about the location within the user's PySpark code where the error occurred will help debuggability for PySpark users. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47273) Implement python stream writer interface
[ https://issues.apache.org/jira/browse/SPARK-47273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chaoqin Li updated SPARK-47273: --- Description: In order to support developing spark streaming sink in python, we need to implement python stream writer interface. Reuse PythonPartitionWriter to implement the serialization and execution of write callback in executor. Implement python worker process to run python streaming data sink committer and communicate with JVM through socket in spark driver. For each python streaming data sink instance there will be a long live python worker process created. Inside the python process, the python write committer will receive abort or commit function call and send back result through socket. was: In order to support developing spark streaming sink in python, we need to implement Reuse PythonPartitionWriter to implement the serialization and execution of write callback in executor. Implement python worker process to run python streaming data sink committer and communicate with JVM through socket in spark driver. For each python streaming data sink instance there will be a long live python worker process created. Inside the python process, the python write committer will receive abort or commit function call and send back result through socket. > Implement python stream writer interface > > > Key: SPARK-47273 > URL: https://issues.apache.org/jira/browse/SPARK-47273 > Project: Spark > Issue Type: Improvement > Components: PySpark, SS >Affects Versions: 4.0.0 >Reporter: Chaoqin Li >Priority: Major > > In order to support developing spark streaming sink in python, we need to > implement python stream writer interface. > Reuse PythonPartitionWriter to implement the serialization and execution of > write callback in executor. > Implement python worker process to run python streaming data sink committer > and communicate with JVM through socket in spark driver. For each python > streaming data sink instance there will be a long live python worker process > created. Inside the python process, the python write committer will receive > abort or commit function call and send back result through socket. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47273) Implement python stream writer interface
[ https://issues.apache.org/jira/browse/SPARK-47273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chaoqin Li updated SPARK-47273: --- Description: In order to support developing spark streaming sink in python, we need to implement Reuse PythonPartitionWriter to implement the serialization and execution of write callback in executor. Implement python worker process to run python streaming data sink committer and communicate with JVM through socket in spark driver. For each python streaming data sink instance there will be a long live python worker process created. Inside the python process, the python write committer will receive abort or commit function call and send back result through socket. was: In order to support developing spark streaming sink in python, we need to implement Reuse PythonPartitionWriter to implement the serialization and execution of write callback in executor. Implement python worker process to run python streaming data sink committer and communicate with JVM through socket in spark driver. For each python streaming data sink instance there will be a long live python worker process created. Inside the python process, the python write committer will receive abort or commit function call and send back result through socket. > Implement python stream writer interface > > > Key: SPARK-47273 > URL: https://issues.apache.org/jira/browse/SPARK-47273 > Project: Spark > Issue Type: Improvement > Components: PySpark, SS >Affects Versions: 4.0.0 >Reporter: Chaoqin Li >Priority: Major > > In order to support developing spark streaming sink in python, we need to > implement > Reuse PythonPartitionWriter to implement the serialization and execution of > write callback in executor. > Implement python worker process to run python streaming data sink committer > and communicate with JVM through socket in spark driver. For each python > streaming data sink instance there will be a long live python worker process > created. Inside the python process, the python write committer will receive > abort or commit function call and send back result through socket. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47273) Implement python stream writer interface
Chaoqin Li created SPARK-47273: -- Summary: Implement python stream writer interface Key: SPARK-47273 URL: https://issues.apache.org/jira/browse/SPARK-47273 Project: Spark Issue Type: Improvement Components: PySpark, SS Affects Versions: 4.0.0 Reporter: Chaoqin Li In order to support developing spark streaming sink in python, we need to implement Reuse PythonPartitionWriter to implement the serialization and execution of write callback in executor. Implement python worker process to run python streaming data sink committer and communicate with JVM through socket in spark driver. For each python streaming data sink instance there will be a long live python worker process created. Inside the python process, the python write committer will receive abort or commit function call and send back result through socket. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36691) PythonRunner failed should pass error message to ApplicationMaster too
[ https://issues.apache.org/jira/browse/SPARK-36691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-36691: --- Labels: pull-request-available (was: ) > PythonRunner failed should pass error message to ApplicationMaster too > -- > > Key: SPARK-36691 > URL: https://issues.apache.org/jira/browse/SPARK-36691 > Project: Spark > Issue Type: Sub-task > Components: PySpark, YARN >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > Labels: pull-request-available > > In current pyspark, stderr and stdout are print together, if py-script > failed, will only throw a SparkUserAppsException with exit code 1. Then pass > this error to AM. > In cluster mode, client side only show > {code} > User application exited with 1. > {code} > Without correct error message. Then user need to find why their job failed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46778) get_json_object flattens wildcard queries that match a single value
[ https://issues.apache.org/jira/browse/SPARK-46778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17823371#comment-17823371 ] Pablo Langa Blanco commented on SPARK-46778: I was looking at it and I found a comment in the code that explain why this behavior ([https://github.com/apache/spark/blob/35bced42474e3221cf61d13a142c3c5470df1f22/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L377]) There are some tests around the code that test it and I reproduced it in hive 3.1.3 and it still maintains this behavior so I don't know if we can change it. > get_json_object flattens wildcard queries that match a single value > --- > > Key: SPARK-46778 > URL: https://issues.apache.org/jira/browse/SPARK-46778 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Robert Joseph Evans >Priority: Major > > I think this impacts all versions of {{{}get_json_object{}}}, but I am not > 100% sure. > The unit test for > [$.store.book[*].reader|https://github.com/apache/spark/blob/39f8e1a5953b5897f893151d24dc585a80c0c8a0/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala#L142-L146] > verifies that the output of this query is a single level JSON array, but > when I put the same JSON and JSON path into [http://jsonpath.com/] I get a > result with multiple levels of nesting. It looks like Apache Spark tries to > flatten lists for {{[*]}} matches when there is only a single element that > matches. > {code:java} > scala> > Seq("""[{"a":"A"},{"b":"B"}]""","""[{"a":"A"},{"a":"B"}]""").toDF("jsonStr").selectExpr("""get_json_object(jsonStr,"$[*].a")""").show(false) > ++ > |get_json_object(jsonStr, $[*].a)| > ++ > |"A" | > |["A","B"] | > ++ {code} > But this has problems in that I no longer have a consistent schema returned, > even if the input schema is known to be consistent. For example if I wanted > to know how many elements matched this query I could wrap it in a > {{json_array_length}} but that does not work in the generic case. > {code:java} > scala> > Seq("""[{"a":"A"},{"b":"B"}]""","""[{"a":"A"},{"a":"B"}]""").toDF("jsonStr").selectExpr("""json_array_length(get_json_object(jsonStr,"$[*].a"))""").show(false) > +---+ > |json_array_length(get_json_object(jsonStr, $[*].a))| > +---+ > |null | > |2 | > +---+ {code} > If the value returned might be a JSON array, and then I would get a number, > but it is wrong. > {code:java} > scala> > Seq("""[{"a":[1,2,3,4,5]},{"b":"B"}]""","""[{"a":[1,2,3,4,5]},{"a":[1,2,3,4,5]}]""").toDF("jsonStr").selectExpr("""json_array_length(get_json_object(jsonStr,"$[*].a"))""").show(false) > +---+ > |json_array_length(get_json_object(jsonStr, $[*].a))| > +---+ > |5 | > |2 | > +---+ {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33152) SPIP: Constraint Propagation code causes OOM issues or increasing compilation time to hours
[ https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17823344#comment-17823344 ] Asif commented on SPARK-33152: -- [~tedjenks] The issue has always been there because of the way constraint prop rule works ( due to it permutational logic). A possible cause why it might have become more common could be due to some changes to fix the previously undetected constraints . The more robust the code becomes in detecting the constraints, chances are it would increase the cost of over all constraints code drastically. For eg if we have a projection with multiple aliases, and say these aliases are used in case when expressions and involve functions taking these aliases, the number of constraints created would be enormous and even then the code ( atleast in 3.2) would not be able to cover all the possible constraints.. so my guess is that any changes to increase the sensitivity of constraints identification will affect the cost of the evaluation of constraints.. > SPIP: Constraint Propagation code causes OOM issues or increasing compilation > time to hours > --- > > Key: SPARK-33152 > URL: https://issues.apache.org/jira/browse/SPARK-33152 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Asif >Priority: Major > Labels: SPIP > Original Estimate: 168h > Remaining Estimate: 168h > > h2. Q1. What are you trying to do? Articulate your objectives using > absolutely no jargon. > Proposing new algorithm to create, store and use constraints for removing > redundant filters & inferring new filters. > The current algorithm has subpar performance in complex expression scenarios > involving aliases( with certain use cases the compilation time can go into > hours), potential to cause OOM, may miss removing redundant filters in > different scenarios, may miss creating IsNotNull constraints in different > scenarios, does not push compound predicates in Join. > # This issue if not fixed can cause OutOfMemory issue or unacceptable query > compilation times. > Have added a test "plan equivalence with case statements and performance > comparison with benefit of more than 10x conservatively" in > org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite. > *With this PR the compilation time is 247 ms vs 13958 ms without the change* > # It is more effective in filter pruning as is evident in some of the tests > in org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite > where current code is not able to identify the redundant filter in some cases. > # It is able to generate a better optimized plan for join queries as it can > push compound predicates. > # The current logic can miss a lot of possible cases of removing redundant > predicates, as it fails to take into account if same attribute or its aliases > are repeated multiple times in a complex expression. > # There are cases where some of the optimizer rules involving removal of > redundant predicates fail to remove on the basis of constraint data. In some > cases the rule works, just by the virtue of previous rules helping it out to > cover the inaccuracy. That the ConstraintPropagation rule & its function of > removal of redundant filters & addition of new inferred filters is dependent > on the working of some of the other unrelated previous optimizer rules is > behaving, is indicative of issues. > # It does away with all the EqualNullSafe constraints as this logic does not > need those constraints to be created. > # There is at least one test in existing ConstraintPropagationSuite which is > missing a IsNotNull constraints because the code incorrectly generated a > EqualsNullSafeConstraint instead of EqualTo constraint, when using the > existing Constraints code. With these changes, the test correctly creates an > EqualTo constraint, resulting in an inferred IsNotNull constraint > # It does away with the current combinatorial logic of evaluation all the > constraints can cause compilation to run into hours or cause OOM. The number > of constraints stored is exactly the same as the number of filters encountered > h2. Q2. What problem is this proposal NOT designed to solve? > It mainly focuses on compile time performance, but in some cases can benefit > run time characteristics too, like inferring IsNotNull filter or pushing down > compound predicates on the join, which currently may get missed/ does not > happen , respectively, by the present code. > h2. Q3. How is it done today, and what are the limits of current practice? > Current ConstraintsPropagation code, pessimistically tries to generates all > the possible combinations of constraints , based on
[jira] [Updated] (SPARK-44746) Improve the documentation for TABLE input arguments for UDTFs
[ https://issues.apache.org/jira/browse/SPARK-44746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-44746: --- Labels: pull-request-available (was: ) > Improve the documentation for TABLE input arguments for UDTFs > - > > Key: SPARK-44746 > URL: https://issues.apache.org/jira/browse/SPARK-44746 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > Labels: pull-request-available > > We should add more examples for using Python UDTFs with TABLE arguments. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47032) Create API for 'analyze' method to send input column(s) to output table unchanged
[ https://issues.apache.org/jira/browse/SPARK-47032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel resolved SPARK-47032. Resolution: Won't Fix After investigating this, we found that any performance benefits from skipping sending column values from the JVM to the Python worker were obviated by extra manipulation of values on the JVM side to forward them to the output table. As a result, we are closing this. > Create API for 'analyze' method to send input column(s) to output table > unchanged > - > > Key: SPARK-47032 > URL: https://issues.apache.org/jira/browse/SPARK-47032 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Daniel >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47272) MapState Implementation for State V2
[ https://issues.apache.org/jira/browse/SPARK-47272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhan updated SPARK-47272: -- Epic Link: SPARK-46815 Description: This task adds changes for MapState implementation in State Api v2. This implementation adds a new encoder/decoder to encode grouping key and user key into a composite key to be put into RocksDB so that we could retrieve key-value pair by user specified user key by one rocksdb get. Issue Type: Task (was: New Feature) > MapState Implementation for State V2 > > > Key: SPARK-47272 > URL: https://issues.apache.org/jira/browse/SPARK-47272 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Jing Zhan >Priority: Major > > This task adds changes for MapState implementation in State Api v2. This > implementation adds a new encoder/decoder to encode grouping key and user key > into a composite key to be put into RocksDB so that we could retrieve > key-value pair by user specified user key by one rocksdb get. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47242) Bump ap-loader 3.0(v8) to support for async-profiler 3.0
[ https://issues.apache.org/jira/browse/SPARK-47242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-47242: -- Parent: SPARK-47046 Issue Type: Sub-task (was: Improvement) > Bump ap-loader 3.0(v8) to support for async-profiler 3.0 > > > Key: SPARK-47242 > URL: https://issues.apache.org/jira/browse/SPARK-47242 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 4.0.0 >Reporter: Nicholas Jiang >Assignee: Nicholas Jiang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > ap-loader 3.0(v8) has already been released, which supports for async-profier > 3.0. The release guide refers to [Loader for 3.0 (v8): Binary launcher and > AsyncGetCallTrace > replacement|https://github.com/jvm-profiling-tools/ap-loader/releases/tag/3.0-8]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47242) Bump ap-loader 3.0(v8) to support for async-profiler 3.0
[ https://issues.apache.org/jira/browse/SPARK-47242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47242. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45351 [https://github.com/apache/spark/pull/45351] > Bump ap-loader 3.0(v8) to support for async-profiler 3.0 > > > Key: SPARK-47242 > URL: https://issues.apache.org/jira/browse/SPARK-47242 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 4.0.0 >Reporter: Nicholas Jiang >Assignee: Nicholas Jiang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > ap-loader 3.0(v8) has already been released, which supports for async-profier > 3.0. The release guide refers to [Loader for 3.0 (v8): Binary launcher and > AsyncGetCallTrace > replacement|https://github.com/jvm-profiling-tools/ap-loader/releases/tag/3.0-8]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47242) Bump ap-loader 3.0(v8) to support for async-profiler 3.0
[ https://issues.apache.org/jira/browse/SPARK-47242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-47242: - Assignee: Nicholas Jiang > Bump ap-loader 3.0(v8) to support for async-profiler 3.0 > > > Key: SPARK-47242 > URL: https://issues.apache.org/jira/browse/SPARK-47242 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 4.0.0 >Reporter: Nicholas Jiang >Assignee: Nicholas Jiang >Priority: Major > Labels: pull-request-available > > ap-loader 3.0(v8) has already been released, which supports for async-profier > 3.0. The release guide refers to [Loader for 3.0 (v8): Binary launcher and > AsyncGetCallTrace > replacement|https://github.com/jvm-profiling-tools/ap-loader/releases/tag/3.0-8]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33152) SPIP: Constraint Propagation code causes OOM issues or increasing compilation time to hours
[ https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17823314#comment-17823314 ] Ted Chester Jenks commented on SPARK-33152: --- [~cloud_fan] [~ashahid7] We have seen a huge uptick in these issues since upgrading from 3.2 to 3.4. It is not exactly clear why these became so prevalent but it is a big issue. > SPIP: Constraint Propagation code causes OOM issues or increasing compilation > time to hours > --- > > Key: SPARK-33152 > URL: https://issues.apache.org/jira/browse/SPARK-33152 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Asif >Priority: Major > Labels: SPIP > Original Estimate: 168h > Remaining Estimate: 168h > > h2. Q1. What are you trying to do? Articulate your objectives using > absolutely no jargon. > Proposing new algorithm to create, store and use constraints for removing > redundant filters & inferring new filters. > The current algorithm has subpar performance in complex expression scenarios > involving aliases( with certain use cases the compilation time can go into > hours), potential to cause OOM, may miss removing redundant filters in > different scenarios, may miss creating IsNotNull constraints in different > scenarios, does not push compound predicates in Join. > # This issue if not fixed can cause OutOfMemory issue or unacceptable query > compilation times. > Have added a test "plan equivalence with case statements and performance > comparison with benefit of more than 10x conservatively" in > org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite. > *With this PR the compilation time is 247 ms vs 13958 ms without the change* > # It is more effective in filter pruning as is evident in some of the tests > in org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite > where current code is not able to identify the redundant filter in some cases. > # It is able to generate a better optimized plan for join queries as it can > push compound predicates. > # The current logic can miss a lot of possible cases of removing redundant > predicates, as it fails to take into account if same attribute or its aliases > are repeated multiple times in a complex expression. > # There are cases where some of the optimizer rules involving removal of > redundant predicates fail to remove on the basis of constraint data. In some > cases the rule works, just by the virtue of previous rules helping it out to > cover the inaccuracy. That the ConstraintPropagation rule & its function of > removal of redundant filters & addition of new inferred filters is dependent > on the working of some of the other unrelated previous optimizer rules is > behaving, is indicative of issues. > # It does away with all the EqualNullSafe constraints as this logic does not > need those constraints to be created. > # There is at least one test in existing ConstraintPropagationSuite which is > missing a IsNotNull constraints because the code incorrectly generated a > EqualsNullSafeConstraint instead of EqualTo constraint, when using the > existing Constraints code. With these changes, the test correctly creates an > EqualTo constraint, resulting in an inferred IsNotNull constraint > # It does away with the current combinatorial logic of evaluation all the > constraints can cause compilation to run into hours or cause OOM. The number > of constraints stored is exactly the same as the number of filters encountered > h2. Q2. What problem is this proposal NOT designed to solve? > It mainly focuses on compile time performance, but in some cases can benefit > run time characteristics too, like inferring IsNotNull filter or pushing down > compound predicates on the join, which currently may get missed/ does not > happen , respectively, by the present code. > h2. Q3. How is it done today, and what are the limits of current practice? > Current ConstraintsPropagation code, pessimistically tries to generates all > the possible combinations of constraints , based on the aliases ( even then > it may miss a lot of combinations if the expression is a complex expression > involving same attribute repeated multiple times within the expression and > there are many aliases to that column). There are query plans in our > production env, which can result in intermediate number of constraints going > into hundreds of thousands, causing OOM or taking time running into hours. > Also there are cases where it incorrectly generates an EqualNullSafe > constraint instead of EqualTo constraint , thus missing a possible IsNull > constraint on column. > Also it only pushes single column predicate on the other side of the
[jira] [Updated] (SPARK-47269) Upgrade jetty to 11.0.20
[ https://issues.apache.org/jira/browse/SPARK-47269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-47269: -- Parent: SPARK-47046 Issue Type: Sub-task (was: Improvement) > Upgrade jetty to 11.0.20 > > > Key: SPARK-47269 > URL: https://issues.apache.org/jira/browse/SPARK-47269 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > fix > * [CVE-2024-22201|https://github.com/advisories/GHSA-rggv-cv7r-mw98] - > HTTP/2 connection not closed after idle timeout when TCP congested -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47269) Upgrade jetty to 11.0.20
[ https://issues.apache.org/jira/browse/SPARK-47269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-47269: - Assignee: Yang Jie > Upgrade jetty to 11.0.20 > > > Key: SPARK-47269 > URL: https://issues.apache.org/jira/browse/SPARK-47269 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > > fix > * [CVE-2024-22201|https://github.com/advisories/GHSA-rggv-cv7r-mw98] - > HTTP/2 connection not closed after idle timeout when TCP congested -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47269) Upgrade jetty to 11.0.20
[ https://issues.apache.org/jira/browse/SPARK-47269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47269. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45371 [https://github.com/apache/spark/pull/45371] > Upgrade jetty to 11.0.20 > > > Key: SPARK-47269 > URL: https://issues.apache.org/jira/browse/SPARK-47269 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > fix > * [CVE-2024-22201|https://github.com/advisories/GHSA-rggv-cv7r-mw98] - > HTTP/2 connection not closed after idle timeout when TCP congested -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46961) Adding processorHandle as a Context Variable
[ https://issues.apache.org/jira/browse/SPARK-46961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-46961: Assignee: Eric Marnadi > Adding processorHandle as a Context Variable > > > Key: SPARK-46961 > URL: https://issues.apache.org/jira/browse/SPARK-46961 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Eric Marnadi >Assignee: Eric Marnadi >Priority: Major > Labels: pull-request-available > > Instead of passing the StatefulProcessorHandle to the user in `init`, instead > embed it as a context variable, ProcessorContext, that the user can fetch -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46961) Adding processorHandle as a Context Variable
[ https://issues.apache.org/jira/browse/SPARK-46961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-46961. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45359 [https://github.com/apache/spark/pull/45359] > Adding processorHandle as a Context Variable > > > Key: SPARK-46961 > URL: https://issues.apache.org/jira/browse/SPARK-46961 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Eric Marnadi >Assignee: Eric Marnadi >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Instead of passing the StatefulProcessorHandle to the user in `init`, instead > embed it as a context variable, ProcessorContext, that the user can fetch -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47271) Explain importance of statistics on SQL performance tuning page
[ https://issues.apache.org/jira/browse/SPARK-47271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47271: --- Labels: pull-request-available (was: ) > Explain importance of statistics on SQL performance tuning page > --- > > Key: SPARK-47271 > URL: https://issues.apache.org/jira/browse/SPARK-47271 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 4.0.0 >Reporter: Nicholas Chammas >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47268) Repartition by column should respect collation
[ https://issues.apache.org/jira/browse/SPARK-47268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-47268: --- Assignee: Aleksandar Tomic > Repartition by column should respect collation > -- > > Key: SPARK-47268 > URL: https://issues.apache.org/jira/browse/SPARK-47268 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Aleksandar Tomic >Assignee: Aleksandar Tomic >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47268) Repartition by column should respect collation
[ https://issues.apache.org/jira/browse/SPARK-47268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-47268. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45370 [https://github.com/apache/spark/pull/45370] > Repartition by column should respect collation > -- > > Key: SPARK-47268 > URL: https://issues.apache.org/jira/browse/SPARK-47268 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Aleksandar Tomic >Assignee: Aleksandar Tomic >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47271) Explain importance of statistics on SQL performance tuning page
Nicholas Chammas created SPARK-47271: Summary: Explain importance of statistics on SQL performance tuning page Key: SPARK-47271 URL: https://issues.apache.org/jira/browse/SPARK-47271 Project: Spark Issue Type: Improvement Components: Documentation, SQL Affects Versions: 4.0.0 Reporter: Nicholas Chammas -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47070) Subquery rewrite inside an aggregation makes an aggregation invalid
[ https://issues.apache.org/jira/browse/SPARK-47070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-47070. - Fix Version/s: 4.0.0 Resolution: Fixed > Subquery rewrite inside an aggregation makes an aggregation invalid > --- > > Key: SPARK-47070 > URL: https://issues.apache.org/jira/browse/SPARK-47070 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Anton Lykov >Assignee: Anton Kirillov >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > When an in/exists-subquery appears inside an aggregate expression within a > top-level GROUP BY, it gets rewritten and a new `exists` variable is > introduced. However, this variable is incorrectly handled in aggregation. For > example, consider the following query: > ``` > SELECT > CASE > WHEN t1.id IN (SELECT id FROM t2) THEN 10 > ELSE -10 > END AS v1 > FROM t1 > GROUP BY t1.id; > ``` > > Executing it leads to the following error: > ``` > java.lang.IllegalArgumentException: Cannot find column index for attribute > 'exists#844' in: Map() > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47131) contains, startswith, endswith
[ https://issues.apache.org/jira/browse/SPARK-47131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-47131. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45216 [https://github.com/apache/spark/pull/45216] > contains, startswith, endswith > -- > > Key: SPARK-47131 > URL: https://issues.apache.org/jira/browse/SPARK-47131 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Assignee: Uroš Bojanić >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Refactored built-in string functions to enable collation support for: > {_}contains{_}, {_}startsWith{_}, {_}endsWith{_}. Spark SQL users should now > be able to use COLLATE within arguments for built-in string functions: > CONTAINS, STARTSWITH, ENDSWITH in Spark SQL queries. Note: CONTAINS > implementation for non-binary collations is a separate subtask (SPARK-47248). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47131) contains, startswith, endswith
[ https://issues.apache.org/jira/browse/SPARK-47131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-47131: --- Assignee: Uroš Bojanić > contains, startswith, endswith > -- > > Key: SPARK-47131 > URL: https://issues.apache.org/jira/browse/SPARK-47131 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Assignee: Uroš Bojanić >Priority: Major > Labels: pull-request-available > > Refactored built-in string functions to enable collation support for: > {_}contains{_}, {_}startsWith{_}, {_}endsWith{_}. Spark SQL users should now > be able to use COLLATE within arguments for built-in string functions: > CONTAINS, STARTSWITH, ENDSWITH in Spark SQL queries. Note: CONTAINS > implementation for non-binary collations is a separate subtask (SPARK-47248). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47270) Dataset.isEmpty should not trigger job execution on CommandResults
[ https://issues.apache.org/jira/browse/SPARK-47270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47270: --- Labels: pull-request-available (was: ) > Dataset.isEmpty should not trigger job execution on CommandResults > -- > > Key: SPARK-47270 > URL: https://issues.apache.org/jira/browse/SPARK-47270 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Zhen Wang >Priority: Major > Labels: pull-request-available > > Similar to [SPARK-43124|https://issues.apache.org/jira/browse/SPARK-43124], > Dataset.isEmpty should also not trigger job execution on CommandResults. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47270) Dataset.isEmpty should not trigger job execution on CommandResults
Zhen Wang created SPARK-47270: - Summary: Dataset.isEmpty should not trigger job execution on CommandResults Key: SPARK-47270 URL: https://issues.apache.org/jira/browse/SPARK-47270 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Zhen Wang Similar to [SPARK-43124|https://issues.apache.org/jira/browse/SPARK-43124], Dataset.isEmpty should also not trigger job execution on CommandResults. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47265) Use `createTable(..., schema: StructType, ...)` instead of `createTable(..., columns: Array[Column], ...)` in UT
[ https://issues.apache.org/jira/browse/SPARK-47265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-47265: Summary: Use `createTable(..., schema: StructType, ...)` instead of `createTable(..., columns: Array[Column], ...)` in UT (was: Enable test of v2 data sources in `FileBasedDataSourceSuite`) > Use `createTable(..., schema: StructType, ...)` instead of `createTable(..., > columns: Array[Column], ...)` in UT > > > Key: SPARK-47265 > URL: https://issues.apache.org/jira/browse/SPARK-47265 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47269) Upgrade jetty to 11.0.20
[ https://issues.apache.org/jira/browse/SPARK-47269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47269: --- Labels: pull-request-available (was: ) > Upgrade jetty to 11.0.20 > > > Key: SPARK-47269 > URL: https://issues.apache.org/jira/browse/SPARK-47269 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > > fix > * [CVE-2024-22201|https://github.com/advisories/GHSA-rggv-cv7r-mw98] - > HTTP/2 connection not closed after idle timeout when TCP congested -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47269) Upgrade jetty to 11.0.20
Yang Jie created SPARK-47269: Summary: Upgrade jetty to 11.0.20 Key: SPARK-47269 URL: https://issues.apache.org/jira/browse/SPARK-47269 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 4.0.0 Reporter: Yang Jie fix * [CVE-2024-22201|https://github.com/advisories/GHSA-rggv-cv7r-mw98] - HTTP/2 connection not closed after idle timeout when TCP congested -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47268) Repartition by column should respect collation
[ https://issues.apache.org/jira/browse/SPARK-47268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47268: --- Labels: pull-request-available (was: ) > Repartition by column should respect collation > -- > > Key: SPARK-47268 > URL: https://issues.apache.org/jira/browse/SPARK-47268 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Aleksandar Tomic >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43258) Assign a name to the error class _LEGACY_ERROR_TEMP_202[3,4,5]
[ https://issues.apache.org/jira/browse/SPARK-43258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-43258. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45288 [https://github.com/apache/spark/pull/45288] > Assign a name to the error class _LEGACY_ERROR_TEMP_202[3,4,5] > -- > > Key: SPARK-43258 > URL: https://issues.apache.org/jira/browse/SPARK-43258 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Max Gekk >Assignee: Deng Ziming >Priority: Minor > Labels: pull-request-available, starter > Fix For: 4.0.0 > > > Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2023* defined in > {*}core/src/main/resources/error/error-classes.json{*}. The name should be > short but complete (look at the example in error-classes.json). > Add a test which triggers the error from user code if such test still doesn't > exist. Check exception fields by using {*}checkError(){*}. The last function > checks valuable error fields only, and avoids dependencies from error text > message. In this way, tech editors can modify error format in > error-classes.json, and don't worry of Spark's internal tests. Migrate other > tests that might trigger the error onto checkError(). > If you cannot reproduce the error from user space (using SQL query), replace > the error by an internal error, see {*}SparkException.internalError(){*}. > Improve the error message format in error-classes.json if the current is not > clear. Propose a solution to users how to avoid and fix such kind of errors. > Please, look at the PR below as examples: > * [https://github.com/apache/spark/pull/38685] > * [https://github.com/apache/spark/pull/38656] > * [https://github.com/apache/spark/pull/38490] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43258) Assign a name to the error class _LEGACY_ERROR_TEMP_202[3,4,5]
[ https://issues.apache.org/jira/browse/SPARK-43258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-43258: Assignee: Deng Ziming > Assign a name to the error class _LEGACY_ERROR_TEMP_202[3,4,5] > -- > > Key: SPARK-43258 > URL: https://issues.apache.org/jira/browse/SPARK-43258 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Max Gekk >Assignee: Deng Ziming >Priority: Minor > Labels: pull-request-available, starter > > Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2023* defined in > {*}core/src/main/resources/error/error-classes.json{*}. The name should be > short but complete (look at the example in error-classes.json). > Add a test which triggers the error from user code if such test still doesn't > exist. Check exception fields by using {*}checkError(){*}. The last function > checks valuable error fields only, and avoids dependencies from error text > message. In this way, tech editors can modify error format in > error-classes.json, and don't worry of Spark's internal tests. Migrate other > tests that might trigger the error onto checkError(). > If you cannot reproduce the error from user space (using SQL query), replace > the error by an internal error, see {*}SparkException.internalError(){*}. > Improve the error message format in error-classes.json if the current is not > clear. Propose a solution to users how to avoid and fix such kind of errors. > Please, look at the PR below as examples: > * [https://github.com/apache/spark/pull/38685] > * [https://github.com/apache/spark/pull/38656] > * [https://github.com/apache/spark/pull/38490] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47268) Repartition by column should respect collation
Aleksandar Tomic created SPARK-47268: Summary: Repartition by column should respect collation Key: SPARK-47268 URL: https://issues.apache.org/jira/browse/SPARK-47268 Project: Spark Issue Type: Task Components: Spark Core Affects Versions: 4.0.0 Reporter: Aleksandar Tomic -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47267) Hash functions should respect collation
Aleksandar Tomic created SPARK-47267: Summary: Hash functions should respect collation Key: SPARK-47267 URL: https://issues.apache.org/jira/browse/SPARK-47267 Project: Spark Issue Type: Task Components: Spark Core Affects Versions: 4.0.0 Reporter: Aleksandar Tomic All functions in `hash_funcs` group should respec collation. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47131) contains, startswith, endswith
[ https://issues.apache.org/jira/browse/SPARK-47131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47131: -- Assignee: (was: Apache Spark) > contains, startswith, endswith > -- > > Key: SPARK-47131 > URL: https://issues.apache.org/jira/browse/SPARK-47131 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > Labels: pull-request-available > > Refactored built-in string functions to enable collation support for: > {_}contains{_}, {_}startsWith{_}, {_}endsWith{_}. Spark SQL users should now > be able to use COLLATE within arguments for built-in string functions: > CONTAINS, STARTSWITH, ENDSWITH in Spark SQL queries. Note: CONTAINS > implementation for non-binary collations is a separate subtask (SPARK-47248). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47131) contains, startswith, endswith
[ https://issues.apache.org/jira/browse/SPARK-47131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47131: -- Assignee: Apache Spark > contains, startswith, endswith > -- > > Key: SPARK-47131 > URL: https://issues.apache.org/jira/browse/SPARK-47131 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > Refactored built-in string functions to enable collation support for: > {_}contains{_}, {_}startsWith{_}, {_}endsWith{_}. Spark SQL users should now > be able to use COLLATE within arguments for built-in string functions: > CONTAINS, STARTSWITH, ENDSWITH in Spark SQL queries. Note: CONTAINS > implementation for non-binary collations is a separate subtask (SPARK-47248). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47131) contains, startswith, endswith
[ https://issues.apache.org/jira/browse/SPARK-47131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47131: -- Assignee: (was: Apache Spark) > contains, startswith, endswith > -- > > Key: SPARK-47131 > URL: https://issues.apache.org/jira/browse/SPARK-47131 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > Labels: pull-request-available > > Refactored built-in string functions to enable collation support for: > {_}contains{_}, {_}startsWith{_}, {_}endsWith{_}. Spark SQL users should now > be able to use COLLATE within arguments for built-in string functions: > CONTAINS, STARTSWITH, ENDSWITH in Spark SQL queries. Note: CONTAINS > implementation for non-binary collations is a separate subtask (SPARK-47248). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47131) contains, startswith, endswith
[ https://issues.apache.org/jira/browse/SPARK-47131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47131: -- Assignee: Apache Spark > contains, startswith, endswith > -- > > Key: SPARK-47131 > URL: https://issues.apache.org/jira/browse/SPARK-47131 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > Refactored built-in string functions to enable collation support for: > {_}contains{_}, {_}startsWith{_}, {_}endsWith{_}. Spark SQL users should now > be able to use COLLATE within arguments for built-in string functions: > CONTAINS, STARTSWITH, ENDSWITH in Spark SQL queries. Note: CONTAINS > implementation for non-binary collations is a separate subtask (SPARK-47248). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47070) Subquery rewrite inside an aggregation makes an aggregation invalid
[ https://issues.apache.org/jira/browse/SPARK-47070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47070: -- Assignee: (was: Apache Spark) > Subquery rewrite inside an aggregation makes an aggregation invalid > --- > > Key: SPARK-47070 > URL: https://issues.apache.org/jira/browse/SPARK-47070 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Anton Lykov >Priority: Major > Labels: pull-request-available > > When an in/exists-subquery appears inside an aggregate expression within a > top-level GROUP BY, it gets rewritten and a new `exists` variable is > introduced. However, this variable is incorrectly handled in aggregation. For > example, consider the following query: > ``` > SELECT > CASE > WHEN t1.id IN (SELECT id FROM t2) THEN 10 > ELSE -10 > END AS v1 > FROM t1 > GROUP BY t1.id; > ``` > > Executing it leads to the following error: > ``` > java.lang.IllegalArgumentException: Cannot find column index for attribute > 'exists#844' in: Map() > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47070) Subquery rewrite inside an aggregation makes an aggregation invalid
[ https://issues.apache.org/jira/browse/SPARK-47070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47070: -- Assignee: Apache Spark > Subquery rewrite inside an aggregation makes an aggregation invalid > --- > > Key: SPARK-47070 > URL: https://issues.apache.org/jira/browse/SPARK-47070 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Anton Lykov >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > When an in/exists-subquery appears inside an aggregate expression within a > top-level GROUP BY, it gets rewritten and a new `exists` variable is > introduced. However, this variable is incorrectly handled in aggregation. For > example, consider the following query: > ``` > SELECT > CASE > WHEN t1.id IN (SELECT id FROM t2) THEN 10 > ELSE -10 > END AS v1 > FROM t1 > GROUP BY t1.id; > ``` > > Executing it leads to the following error: > ``` > java.lang.IllegalArgumentException: Cannot find column index for attribute > 'exists#844' in: Map() > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47266) Make `ProtoUtils.abbreviate` return the same type as the input
[ https://issues.apache.org/jira/browse/SPARK-47266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-47266. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45369 [https://github.com/apache/spark/pull/45369] > Make `ProtoUtils.abbreviate` return the same type as the input > -- > > Key: SPARK-47266 > URL: https://issues.apache.org/jira/browse/SPARK-47266 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47253) Allow LiveEventBus to stop without the completely draining of event queue
[ https://issues.apache.org/jira/browse/SPARK-47253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] TakawaAkirayo updated SPARK-47253: -- Affects Version/s: 4.0.0 (was: 3.5.0) > Allow LiveEventBus to stop without the completely draining of event queue > - > > Key: SPARK-47253 > URL: https://issues.apache.org/jira/browse/SPARK-47253 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: TakawaAkirayo >Priority: Minor > Labels: pull-request-available > > #Problem statement: > The SparkContext.stop() hung a long time on LiveEventBus.stop() when > listeners slow > #User scenarios: > We have a centralized service with multiple instances to regularly execute > user's scheduled tasks. > For each user task within one service instance, the process is as follows: > 1.Create a Spark session directly within the service process with an account > defined in the task. > 2.Instantiate listeners by class names and register them with the > SparkContext. The JARs containing the listener classes are uploaded to the > service by the user. > 3.Prepare resources. > 4.Run user logic (Spark SQL). > 5.Stop the Spark session by invoking SparkSession.stop(). > In step 5, it will wait for the LiveEventBus to stop, which requires the > remaining events to be completely drained by each listener. > Since the listener is implemented by users and we cannot prevent some heavy > stuffs within the listener on each event, there are cases where a single > heavy job has over 30,000 tasks, > and it could take 30 minutes for the listener to process all the remaining > events, because within the listener, it requires a coarse-grained global lock > and update the internal status to the remote database. > This kind of delay affects other user tasks in the queue. Therefore, from the > server side perspective, we need the guarantee that the stop operation > finishes quickly. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47266) Make `ProtoUtils.abbreviate` return the same type as the input
Ruifeng Zheng created SPARK-47266: - Summary: Make `ProtoUtils.abbreviate` return the same type as the input Key: SPARK-47266 URL: https://issues.apache.org/jira/browse/SPARK-47266 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org