[GitHub] [spark] mridulm commented on pull request #37779: [wip][SPARK-40320][Core] Executor should exit when it failed to initialize for fatal error

2022-09-27 Thread GitBox


mridulm commented on PR #37779:
URL: https://github.com/apache/spark/pull/37779#issuecomment-1259029762

   Can you take a look at comment above @yabola and work on the fix ? Since you 
already spent a lot of time on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #37779: [wip][SPARK-40320][Core] Executor should exit when it failed to initialize for fatal error

2022-09-26 Thread GitBox


mridulm commented on PR #37779:
URL: https://github.com/apache/spark/pull/37779#issuecomment-1259007734

   Thanks for the query @Ngone51 - I missed out one aspect of my analysis, 
which ends up completely changing the solution - my bad :-(
   
   The answer to your query has the reason for the lack of failure - this is 
due to the two types of api's we are using ...
   
   For `DeducatedMessageLoop`, for the initial submission of 
`receiveLoopRunnable` - we use `ExecutorService.submit` api - while for other 
cases, we use the `Executor.execute` api - and this is the cause for the 
behavior.
   
   The `submit` api returns a `Future` - and if we look at the implementation 
of `FutureTask.run`, we see that it catches `Throwable` and preserves that as 
the outcome (`setException`) - which is the reason why the thread itself does 
not die.
   
   
   So the specific case might be mitigated by using `execute` instead of 
`submit` api.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #37779: [wip][SPARK-40320][Core] Executor should exit when it failed to initialize for fatal error

2022-09-26 Thread GitBox


mridulm commented on PR #37779:
URL: https://github.com/apache/spark/pull/37779#issuecomment-1258716473

   Added a few debug statements, and it became clear what the issue is.
   Essentially, since we are leveraging a `ThreadPoolExecutor`, it does not 
result in killing the thread with the exception/error thrown - but rather, will 
call `ThreadPoolExecutor.afterExecute` with the cause for failure (See 
`runWorker` for more).
   
   We should be overriding this, and invoke our `uncaughtExceptionHandler` when 
an exception is thrown.
   
   
   In `receiveLoopRunnable` when a `Throwable` is thrown:
   
   ```
   22/09/26 17:17:12 INFO DedicatedMessageLoop: Current exceptionHandler = 
org.apache.spark.util.SparkUncaughtExceptionHandler@27c71f14
   22/09/26 17:17:12 INFO DedicatedMessageLoop: Thread = 
Thread[dispatcher-Executor,5,main]
   22/09/26 17:17:12 INFO DedicatedMessageLoop: Stack ...
   java.lang.Exception: For stack
   at 
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:56)
   at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
   at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
   at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
   at java.base/java.lang.Thread.run(Thread.java:829)
   
   ```
   
   In `receiveLoopRunnable`'s `run`, when a `Throwable` is thrown:
   ```
   2/09/26 17:17:12 INFO DedicatedMessageLoop: Thread = 
Thread[dispatcher-Executor,5,main], stackTrace =
   22/09/26 17:17:12 INFO DedicatedMessageLoop: 
java.base@11.0.16/java.lang.Thread.dumpThreads(Native Method)
   22/09/26 17:17:12 INFO DedicatedMessageLoop: 
java.base@11.0.16/java.lang.Thread.getAllStackTraces(Thread.java:1653)
   22/09/26 17:17:12 INFO DedicatedMessageLoop: 
app//org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$dumpAllStackTraces(MessageLoop.scala:70)
   22/09/26 17:17:12 INFO DedicatedMessageLoop: 
app//org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:58)
   22/09/26 17:17:12 INFO DedicatedMessageLoop: 
java.base@11.0.16/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
   22/09/26 17:17:12 INFO DedicatedMessageLoop: 
java.base@11.0.16/java.util.concurrent.FutureTask.run(FutureTask.java:264)
   22/09/26 17:17:12 INFO DedicatedMessageLoop: 
java.base@11.0.16/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
   22/09/26 17:17:12 INFO DedicatedMessageLoop: 
java.base@11.0.16/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
   
   ```
   
   
   and finally, a few seconds after Executor Inbox failure - dumping all 
threads in a new thread.
   ```
   22/09/26 17:17:14 INFO DedicatedMessageLoop: Thread = 
Thread[dispatcher-Executor,5,main], stackTrace =
   22/09/26 17:17:14 INFO DedicatedMessageLoop: 
java.base@11.0.16/jdk.internal.misc.Unsafe.park(Native Method)
   22/09/26 17:17:14 INFO DedicatedMessageLoop: 
java.base@11.0.16/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
   22/09/26 17:17:14 INFO DedicatedMessageLoop: 
java.base@11.0.16/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2081)
   22/09/26 17:17:14 INFO DedicatedMessageLoop: 
java.base@11.0.16/java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:433)
   22/09/26 17:17:14 INFO DedicatedMessageLoop: 
app//org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:102)
   22/09/26 17:17:14 INFO DedicatedMessageLoop: 
app//org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:45)
   22/09/26 17:17:14 INFO DedicatedMessageLoop: 
java.base@11.0.16/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
   22/09/26 17:17:14 INFO DedicatedMessageLoop: 
java.base@11.0.16/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
   22/09/26 17:17:14 INFO DedicatedMessageLoop: 
java.base@11.0.16/java.lang.Thread.run(Thread.java:829)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #37779: [wip][SPARK-40320][Core] Executor should exit when it failed to initialize for fatal error

2022-09-26 Thread GitBox


mridulm commented on PR #37779:
URL: https://github.com/apache/spark/pull/37779#issuecomment-1258427841

   I agree @Ngone51, it should have :-) I am trying to reproduce this locally 
and see if we are missing something - or there is some nuance behind the 
observed behavior.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #37779: [wip][SPARK-40320][Core] Executor should exit when it failed to initialize for fatal error

2022-09-23 Thread GitBox


mridulm commented on PR #37779:
URL: https://github.com/apache/spark/pull/37779#issuecomment-1255831313

   I will try to reproduce this locally - is there anything else specific I 
need to know to do so ? Or is the instructions in SPARK-40320 sufficient ? Thx


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #37779: [wip][SPARK-40320][Core] Executor should exit when it failed to initialize for fatal error

2022-09-22 Thread GitBox


mridulm commented on PR #37779:
URL: https://github.com/apache/spark/pull/37779#issuecomment-1255648587

   A few points:
   
   * Plugins are initialized as part of `Executor` construction, but after 
uncaught exception handler is set.
   * If anything other than `NonFatal` is thrown in `receive` (like the 
`LinkageError` from @yabola's example here), it will be handled in 
`Inbox.process` as part of `safelyCall` - which will result in re-throwing it.
   * This now bubbles up to `receiveLoop` - where we end up throwing the 
throwable again (after scheduling it for re-execution).
   * The throw should result in uncaught exception handler killing the jvm - 
and if it does not, then the re-enqueue in prev step will cause the message to 
be reprocessed.
 * In this case we are discussing, it should have resulted in killing the 
VM.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #37779: [wip][SPARK-40320][Core] Executor should exit when it failed to initialize for fatal error

2022-09-21 Thread GitBox


mridulm commented on PR #37779:
URL: https://github.com/apache/spark/pull/37779#issuecomment-1254457075

   I am not sure I follow what the code snippet is trying to do.
   Changing the code to :
   
   ```
 private def receiveLoop() {
 Executors.newSingleThreadExecutor(threadFactory).execute(new 
MessageLoop)
   }
   ```
   does cause the `uncaughtException` to be invoked - as expected.
   If we catch the `Throwable` in `receiveLoop` before that, then obviously it 
is no longer an uncaught exception - and so 
`MyExceptionHandler.uncaughtException` wont be invoked.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org