[GitHub] [spark] mridulm commented on pull request #37779: [wip][SPARK-40320][Core] Executor should exit when it failed to initialize for fatal error
mridulm commented on PR #37779: URL: https://github.com/apache/spark/pull/37779#issuecomment-1259029762 Can you take a look at comment above @yabola and work on the fix ? Since you already spent a lot of time on this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #37779: [wip][SPARK-40320][Core] Executor should exit when it failed to initialize for fatal error
mridulm commented on PR #37779: URL: https://github.com/apache/spark/pull/37779#issuecomment-1259007734 Thanks for the query @Ngone51 - I missed out one aspect of my analysis, which ends up completely changing the solution - my bad :-( The answer to your query has the reason for the lack of failure - this is due to the two types of api's we are using ... For `DeducatedMessageLoop`, for the initial submission of `receiveLoopRunnable` - we use `ExecutorService.submit` api - while for other cases, we use the `Executor.execute` api - and this is the cause for the behavior. The `submit` api returns a `Future` - and if we look at the implementation of `FutureTask.run`, we see that it catches `Throwable` and preserves that as the outcome (`setException`) - which is the reason why the thread itself does not die. So the specific case might be mitigated by using `execute` instead of `submit` api. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #37779: [wip][SPARK-40320][Core] Executor should exit when it failed to initialize for fatal error
mridulm commented on PR #37779: URL: https://github.com/apache/spark/pull/37779#issuecomment-1258716473 Added a few debug statements, and it became clear what the issue is. Essentially, since we are leveraging a `ThreadPoolExecutor`, it does not result in killing the thread with the exception/error thrown - but rather, will call `ThreadPoolExecutor.afterExecute` with the cause for failure (See `runWorker` for more). We should be overriding this, and invoke our `uncaughtExceptionHandler` when an exception is thrown. In `receiveLoopRunnable` when a `Throwable` is thrown: ``` 22/09/26 17:17:12 INFO DedicatedMessageLoop: Current exceptionHandler = org.apache.spark.util.SparkUncaughtExceptionHandler@27c71f14 22/09/26 17:17:12 INFO DedicatedMessageLoop: Thread = Thread[dispatcher-Executor,5,main] 22/09/26 17:17:12 INFO DedicatedMessageLoop: Stack ... java.lang.Exception: For stack at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:56) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) ``` In `receiveLoopRunnable`'s `run`, when a `Throwable` is thrown: ``` 2/09/26 17:17:12 INFO DedicatedMessageLoop: Thread = Thread[dispatcher-Executor,5,main], stackTrace = 22/09/26 17:17:12 INFO DedicatedMessageLoop: java.base@11.0.16/java.lang.Thread.dumpThreads(Native Method) 22/09/26 17:17:12 INFO DedicatedMessageLoop: java.base@11.0.16/java.lang.Thread.getAllStackTraces(Thread.java:1653) 22/09/26 17:17:12 INFO DedicatedMessageLoop: app//org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$dumpAllStackTraces(MessageLoop.scala:70) 22/09/26 17:17:12 INFO DedicatedMessageLoop: app//org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:58) 22/09/26 17:17:12 INFO DedicatedMessageLoop: java.base@11.0.16/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) 22/09/26 17:17:12 INFO DedicatedMessageLoop: java.base@11.0.16/java.util.concurrent.FutureTask.run(FutureTask.java:264) 22/09/26 17:17:12 INFO DedicatedMessageLoop: java.base@11.0.16/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 22/09/26 17:17:12 INFO DedicatedMessageLoop: java.base@11.0.16/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ``` and finally, a few seconds after Executor Inbox failure - dumping all threads in a new thread. ``` 22/09/26 17:17:14 INFO DedicatedMessageLoop: Thread = Thread[dispatcher-Executor,5,main], stackTrace = 22/09/26 17:17:14 INFO DedicatedMessageLoop: java.base@11.0.16/jdk.internal.misc.Unsafe.park(Native Method) 22/09/26 17:17:14 INFO DedicatedMessageLoop: java.base@11.0.16/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194) 22/09/26 17:17:14 INFO DedicatedMessageLoop: java.base@11.0.16/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2081) 22/09/26 17:17:14 INFO DedicatedMessageLoop: java.base@11.0.16/java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:433) 22/09/26 17:17:14 INFO DedicatedMessageLoop: app//org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:102) 22/09/26 17:17:14 INFO DedicatedMessageLoop: app//org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:45) 22/09/26 17:17:14 INFO DedicatedMessageLoop: java.base@11.0.16/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 22/09/26 17:17:14 INFO DedicatedMessageLoop: java.base@11.0.16/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 22/09/26 17:17:14 INFO DedicatedMessageLoop: java.base@11.0.16/java.lang.Thread.run(Thread.java:829) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #37779: [wip][SPARK-40320][Core] Executor should exit when it failed to initialize for fatal error
mridulm commented on PR #37779: URL: https://github.com/apache/spark/pull/37779#issuecomment-1258427841 I agree @Ngone51, it should have :-) I am trying to reproduce this locally and see if we are missing something - or there is some nuance behind the observed behavior. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #37779: [wip][SPARK-40320][Core] Executor should exit when it failed to initialize for fatal error
mridulm commented on PR #37779: URL: https://github.com/apache/spark/pull/37779#issuecomment-1255831313 I will try to reproduce this locally - is there anything else specific I need to know to do so ? Or is the instructions in SPARK-40320 sufficient ? Thx -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #37779: [wip][SPARK-40320][Core] Executor should exit when it failed to initialize for fatal error
mridulm commented on PR #37779: URL: https://github.com/apache/spark/pull/37779#issuecomment-1255648587 A few points: * Plugins are initialized as part of `Executor` construction, but after uncaught exception handler is set. * If anything other than `NonFatal` is thrown in `receive` (like the `LinkageError` from @yabola's example here), it will be handled in `Inbox.process` as part of `safelyCall` - which will result in re-throwing it. * This now bubbles up to `receiveLoop` - where we end up throwing the throwable again (after scheduling it for re-execution). * The throw should result in uncaught exception handler killing the jvm - and if it does not, then the re-enqueue in prev step will cause the message to be reprocessed. * In this case we are discussing, it should have resulted in killing the VM. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #37779: [wip][SPARK-40320][Core] Executor should exit when it failed to initialize for fatal error
mridulm commented on PR #37779: URL: https://github.com/apache/spark/pull/37779#issuecomment-1254457075 I am not sure I follow what the code snippet is trying to do. Changing the code to : ``` private def receiveLoop() { Executors.newSingleThreadExecutor(threadFactory).execute(new MessageLoop) } ``` does cause the `uncaughtException` to be invoked - as expected. If we catch the `Throwable` in `receiveLoop` before that, then obviously it is no longer an uncaught exception - and so `MyExceptionHandler.uncaughtException` wont be invoked. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org