[GitHub] spark pull request: [SPARK-1667] Jobs never finish successfully on...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1383#issuecomment-50443081 Sorry to come back to this after a while. Disk faults can be transient as well right? I'm not sure if we'd want to exit the executor simply because of one disk fault. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1667] Jobs never finish successfully on...
Github user sarutak commented on the pull request: https://github.com/apache/spark/pull/1383#issuecomment-50478348 @rxin Thank you for your comment. On a second thought, it's not good solution and I noticed the root cause of this issue is that FetchFailedException is not thrown when local fetch has failed. https://github.com/apache/spark/pull/1578 may be better solution. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1667] Jobs never finish successfully on...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1383#issuecomment-50506336 Thanks - do you mind closing this one? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1667] Jobs never finish successfully on...
Github user sarutak commented on the pull request: https://github.com/apache/spark/pull/1383#issuecomment-50506697 OK. Instead, please watch this PR https://github.com/apache/spark/pull/1578 . This maybe a solution for this issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1667] Jobs never finish successfully on...
Github user sarutak commented on the pull request: https://github.com/apache/spark/pull/1383#issuecomment-49379957 @rxin, I noticed some issues related to this issue. When, following 3 situation which maybe disk fault , executor doesn't stop. So, tasks assigned to the executor always fail. Should we exit executor at the situation? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1667] Jobs never finish successfully on...
Github user sarutak commented on the pull request: https://github.com/apache/spark/pull/1383#issuecomment-49255082 My PR handles IOException as fatal but I think it's not good because IOException is not always fatal. The problem I want to solve is IOException thrown when writing to and reading from local directory managed by DiskBlockManager are failed. In such case, I think it's almost caused by disk fault. So, is it better to exit when reading from or writing to local directory are failed? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1667] Jobs never finish successfully on...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1383#issuecomment-49102405 Thanks for submitting this. Is there any way we can construct a unit test for this as well? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1667] Jobs never finish successfully on...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1383#discussion_r14968203 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -1223,6 +1223,8 @@ private[spark] object Utils extends Logging { /** Returns true if the given exception was fatal. See docs for scala.util.control.NonFatal. */ def isFatalError(e: Throwable): Boolean = { e match { + case _: IOException = --- End diff -- Can you add some inline comment explaining why we are catching this IOException here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1667] Jobs never finish successfully on...
Github user sarutak commented on the pull request: https://github.com/apache/spark/pull/1383#issuecomment-49104351 OK. I will add a comment for my change. And I will also add test case for this issue to FailureSuite.scala. Is that proper? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1667] Jobs never finish successfully on...
GitHub user sarutak opened a pull request: https://github.com/apache/spark/pull/1383 [SPARK-1667] Jobs never finish successfully once bucket file missing occurred If jobs execute shuffle, bucket files are created in a temporary directory (named like spark-local-*). When the bucket files are missing cased by disk failure or any reasons, jobs cannot execute shuffle which has same shuffle id for the bucket files. I think when Executors cannot read bucket files from their local directory (spark-local-*), they should abort and marked as lost. In this case, Executor which has bucket files throw FileNotFoundException, so I think, we should handle IOException as fatal in Utils.scala to abort. After I modified the code as follows, an Executor which fetches bucket files from failed Executor could retry to fetch from another Executor. def isFatalError(e: Throwable): Boolean = { e match { case _: IOException = true case NonFatal(_) | _: InterruptedException | _: NotImplementedError | _: ControlThrowable = false case _ = true } } You can merge this pull request into a Git repository by running: $ git pull https://github.com/sarutak/spark SPARK-1667 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1383.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1383 commit 1fc1ba05d21d9732e8c80273d13792a2879c8f1f Author: Kousuke Saruta saru...@oss.nttdata.co.jp Date: 2014-07-12T01:08:26Z Modify for SPARK-1667 commit c2044d6a53e7767b89c46f92ad5cec4635602bac Author: Kousuke Saruta saru...@oss.nttdata.co.jp Date: 2014-07-12T02:31:55Z Modified Utils.scala to handle IOException as fatal --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1667] Jobs never finish successfully on...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1383#issuecomment-48800479 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---