[GitHub] spark pull request: [RFC] SPARK-1772 Stop catching Throwable, let ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/715#issuecomment-42716336 Merged build finished. All automated tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [RFC] SPARK-1772 Stop catching Throwable, let ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/715#issuecomment-42712525 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [RFC] SPARK-1772 Stop catching Throwable, let ...
Github user ScrapCodes commented on the pull request: https://github.com/apache/spark/pull/715#issuecomment-42799372 Looks good to me too ! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [RFC] SPARK-1772 Stop catching Throwable, let ...
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/715#discussion_r12515148 --- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala --- @@ -259,19 +238,30 @@ private[spark] class Executor( } case t: Throwable = { - val serviceTime = System.currentTimeMillis() - taskStart - val metrics = attemptedTask.flatMap(t = t.metrics) - for (m - metrics) { -m.executorRunTime = serviceTime -m.jvmGCTime = gcTime - startGCTime - } - val reason = ExceptionFailure(t.getClass.getName, t.toString, t.getStackTrace, metrics) - execBackend.statusUpdate(taskId, TaskState.FAILED, ser.serialize(reason)) + // Attempt to exit cleanly by informing the driver of our failure. + // If anything goes wrong (or this was a fatal exception), we will delegate to + // the default uncaught exception handler, which will terminate the Executor. + try { +logError(Exception in task ID + taskId, t) + +val serviceTime = System.currentTimeMillis() - taskStart +val metrics = attemptedTask.flatMap(t = t.metrics) +for (m - metrics) { + m.executorRunTime = serviceTime + m.jvmGCTime = gcTime - startGCTime +} +val reason = ExceptionFailure(t.getClass.getName, t.toString, t.getStackTrace, metrics) +execBackend.statusUpdate(taskId, TaskState.FAILED, ser.serialize(reason)) - // TODO: Should we exit the whole executor here? On the one hand, the failed task may - // have left some weird state around depending on when the exception was thrown, but on - // the other hand, maybe we could detect that when future tasks fail and exit then. - logError(Exception in task ID + taskId, t) +// Don't forcibly exit unless the exception was inherently fatal, to avoid +// stopping other tasks unnecessarily. +if (Utils.isFatalError(t)) { + ExecutorUncaughtExceptionHandler.uncaughtException(t) +} + } catch { +case t2: Throwable = + ExecutorUncaughtExceptionHandler.uncaughtException(t2) --- End diff -- Hmm, good point. I kind of like being explicit over relying on the globally set uncaught exception handler. I could be happy with getting rid of this and replacing it with a comment, though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [RFC] SPARK-1772 Stop catching Throwable, let ...
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/715#discussion_r12515152 --- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala --- @@ -259,19 +238,30 @@ private[spark] class Executor( } case t: Throwable = { - val serviceTime = System.currentTimeMillis() - taskStart - val metrics = attemptedTask.flatMap(t = t.metrics) - for (m - metrics) { -m.executorRunTime = serviceTime -m.jvmGCTime = gcTime - startGCTime - } - val reason = ExceptionFailure(t.getClass.getName, t.toString, t.getStackTrace, metrics) - execBackend.statusUpdate(taskId, TaskState.FAILED, ser.serialize(reason)) + // Attempt to exit cleanly by informing the driver of our failure. + // If anything goes wrong (or this was a fatal exception), we will delegate to + // the default uncaught exception handler, which will terminate the Executor. + try { +logError(Exception in task ID + taskId, t) + +val serviceTime = System.currentTimeMillis() - taskStart +val metrics = attemptedTask.flatMap(t = t.metrics) +for (m - metrics) { + m.executorRunTime = serviceTime + m.jvmGCTime = gcTime - startGCTime +} +val reason = ExceptionFailure(t.getClass.getName, t.toString, t.getStackTrace, metrics) +execBackend.statusUpdate(taskId, TaskState.FAILED, ser.serialize(reason)) - // TODO: Should we exit the whole executor here? On the one hand, the failed task may - // have left some weird state around depending on when the exception was thrown, but on - // the other hand, maybe we could detect that when future tasks fail and exit then. - logError(Exception in task ID + taskId, t) +// Don't forcibly exit unless the exception was inherently fatal, to avoid +// stopping other tasks unnecessarily. +if (Utils.isFatalError(t)) { + ExecutorUncaughtExceptionHandler.uncaughtException(t) +} + } catch { +case t2: Throwable = + ExecutorUncaughtExceptionHandler.uncaughtException(t2) --- End diff -- Actually just realized we basically already have that comment, just interpreted in a different way :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [RFC] SPARK-1772 Stop catching Throwable, let ...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/715#issuecomment-42763346 /cc @scrapcodes who worked on the IndestructableActorSystem --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [RFC] SPARK-1772 Stop catching Throwable, let ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/715#issuecomment-42716337 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14854/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [RFC] SPARK-1772 Stop catching Throwable, let ...
Github user ScrapCodes commented on a diff in the pull request: https://github.com/apache/spark/pull/715#discussion_r12514419 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala --- @@ -71,7 +71,7 @@ private[spark] class PythonWorkerFactory(pythonExec: String, envVars: Map[String stopDaemon() startDaemon() new Socket(daemonHost, daemonPort) -case e: Throwable = throw e +case e: Exception = throw e --- End diff -- Do we really need this line ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [RFC] SPARK-1772 Stop catching Throwable, let ...
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/715#issuecomment-42797560 Looks pretty good to me, just made one small comment. I think it's good to eliminate these now. I haven't seen many cases where they're super useful. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [RFC] SPARK-1772 Stop catching Throwable, let ...
Github user ScrapCodes commented on the pull request: https://github.com/apache/spark/pull/715#issuecomment-42797519 It also turns out that it is unlikely that the IndestructibleActorSystem actually works, given testing (here). I works but in case of OOMs, the behavior can be very sporadic. The only reason it was needed was in akka 2.0.x days netty was tolerating OOMs and it was thus never the *chance* that Akka got to deal with them. Since netty almost always had them and mostly manage to eat them. Very weird things. Just saying for posterity. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [RFC] SPARK-1772 Stop catching Throwable, let ...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/715#discussion_r12514626 --- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala --- @@ -259,19 +238,30 @@ private[spark] class Executor( } case t: Throwable = { - val serviceTime = System.currentTimeMillis() - taskStart - val metrics = attemptedTask.flatMap(t = t.metrics) - for (m - metrics) { -m.executorRunTime = serviceTime -m.jvmGCTime = gcTime - startGCTime - } - val reason = ExceptionFailure(t.getClass.getName, t.toString, t.getStackTrace, metrics) - execBackend.statusUpdate(taskId, TaskState.FAILED, ser.serialize(reason)) + // Attempt to exit cleanly by informing the driver of our failure. + // If anything goes wrong (or this was a fatal exception), we will delegate to + // the default uncaught exception handler, which will terminate the Executor. + try { +logError(Exception in task ID + taskId, t) + +val serviceTime = System.currentTimeMillis() - taskStart +val metrics = attemptedTask.flatMap(t = t.metrics) +for (m - metrics) { + m.executorRunTime = serviceTime + m.jvmGCTime = gcTime - startGCTime +} +val reason = ExceptionFailure(t.getClass.getName, t.toString, t.getStackTrace, metrics) +execBackend.statusUpdate(taskId, TaskState.FAILED, ser.serialize(reason)) - // TODO: Should we exit the whole executor here? On the one hand, the failed task may - // have left some weird state around depending on when the exception was thrown, but on - // the other hand, maybe we could detect that when future tasks fail and exit then. - logError(Exception in task ID + taskId, t) +// Don't forcibly exit unless the exception was inherently fatal, to avoid +// stopping other tasks unnecessarily. +if (Utils.isFatalError(t)) { + ExecutorUncaughtExceptionHandler.uncaughtException(t) +} + } catch { +case t2: Throwable = + ExecutorUncaughtExceptionHandler.uncaughtException(t2) --- End diff -- Can't the uncaught exception handler for this thread be set to deal with this, instead of another catch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [RFC] SPARK-1772 Stop catching Throwable, let ...
GitHub user aarondav opened a pull request: https://github.com/apache/spark/pull/715 [RFC] SPARK-1772 Stop catching Throwable, let Executors die The main issue this patch fixes is [SPARK-1772](https://issues.apache.org/jira/browse/SPARK-1772), in which Executors may not die when fatal exceptions (e.g., OOM) are thrown. This patch causes Executors to delegate to the ExecutorUncaughtExceptionHandler when a fatal exception is thrown. This patch also continues the fight in the neverending war against `case t: Throwable =`, by only catching Exceptions in many places, and adding a wrapper for Threads and Runnables to make sure any uncaught exceptions are at least printed to the logs. It also turns out that it is unlikely that the IndestructibleActorSystem actually works, given testing ([here](https://gist.github.com/aarondav/ca1f0cdcd50727f89c0d)). The uncaughtExceptionHandler is not called from the places that we expected it would be. [SPARK-1620](https://issues.apache.org/jira/browse/SPARK-1620) deals with part of this issue, but refactoring our Actor Systems to ensure that exceptions are dealt with properly is a much bigger change, outside the scope of this PR. You can merge this pull request into a Git repository by running: $ git pull https://github.com/aarondav/spark throwable Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/715.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #715 commit 1867867a00241ff1bd20d2ac3ac610ed126a9280 Author: Aaron Davidson aa...@databricks.com Date: 2014-05-09T20:28:26Z [RFC] SPARK-1772 Stop catching Throwable, let Executors die The main issue this patch fixes is [SPARK-1772](https://issues.apache.org/jira/browse/SPARK-1772), in which Executors may not die when fatal exceptions (e.g., OOM) are thrown. This patch causes Executors to delegate to the ExecutorUncaughtExceptionHandler when a fatal exception is thrown. This patch also continues the fight in the neverending war against `case t: Throwable =`, by only catching Exceptions in many places, and adding a wrapper for Threads and Runnables to make sure any uncaught exceptions are at least printed to the logs. It also turns out that it is unlikely that the IndestructibleActorSystem actually works, given testing ([here](https://gist.github.com/aarondav/ca1f0cdcd50727f89c0d)). The uncaughtExceptionHandler is not called from the places that we expected it would be. [SPARK-1620](https://issues.apache.org/jira/browse/SPARK-1620) deals with part of this issue, but refactoring our Actor Systems to ensure that exceptions are dealt with properly is a much bigger change, outside the scope of this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [RFC] SPARK-1772 Stop catching Throwable, let ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/715#issuecomment-42712542 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---