[GitHub] spark pull request: [SPARK-1667] Jobs never finish successfully on...

2014-07-29 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1383#issuecomment-50443081
  
Sorry to come back to this after a while. Disk faults can be transient as 
well right? I'm not sure if we'd want to exit the executor simply because of 
one disk fault. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1667] Jobs never finish successfully on...

2014-07-29 Thread sarutak
Github user sarutak commented on the pull request:

https://github.com/apache/spark/pull/1383#issuecomment-50478348
  
@rxin Thank you for your comment.
On a second thought, it's not good solution and I noticed the root cause of 
this issue is that FetchFailedException is not thrown when local fetch has 
failed.
https://github.com/apache/spark/pull/1578 may be better solution.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1667] Jobs never finish successfully on...

2014-07-29 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1383#issuecomment-50506336
  
Thanks - do you mind closing this one?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1667] Jobs never finish successfully on...

2014-07-29 Thread sarutak
Github user sarutak commented on the pull request:

https://github.com/apache/spark/pull/1383#issuecomment-50506697
  
OK. Instead, please watch this PR https://github.com/apache/spark/pull/1578 
.
This maybe a solution for this issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1667] Jobs never finish successfully on...

2014-07-17 Thread sarutak
Github user sarutak commented on the pull request:

https://github.com/apache/spark/pull/1383#issuecomment-49379957
  
@rxin, I noticed some issues related to this issue.
When, following 3 situation which maybe disk fault , executor doesn't stop.
So, tasks assigned to the executor always fail.

Should we exit executor at the situation? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1667] Jobs never finish successfully on...

2014-07-16 Thread sarutak
Github user sarutak commented on the pull request:

https://github.com/apache/spark/pull/1383#issuecomment-49255082
  
My PR handles IOException as fatal but I think it's not good because 
IOException is not always fatal.
The problem I want to solve is IOException thrown when writing to and 
reading from local directory managed by DiskBlockManager are failed.
In such case, I think it's almost caused by disk fault.
So, is it better to exit when reading from or writing to local directory 
are failed?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1667] Jobs never finish successfully on...

2014-07-15 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1383#issuecomment-49102405
  
Thanks for submitting this. Is there any way we can construct a unit test 
for this as well?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1667] Jobs never finish successfully on...

2014-07-15 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1383#discussion_r14968203
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -1223,6 +1223,8 @@ private[spark] object Utils extends Logging {
   /** Returns true if the given exception was fatal. See docs for 
scala.util.control.NonFatal. */
   def isFatalError(e: Throwable): Boolean = {
 e match {
+  case _: IOException =
--- End diff --

Can you add some inline comment explaining why we are catching this 
IOException here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1667] Jobs never finish successfully on...

2014-07-15 Thread sarutak
Github user sarutak commented on the pull request:

https://github.com/apache/spark/pull/1383#issuecomment-49104351
  
OK. I will add a comment for my change.
And I will also add test case for this issue to FailureSuite.scala. Is that 
proper?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1667] Jobs never finish successfully on...

2014-07-11 Thread sarutak
GitHub user sarutak opened a pull request:

https://github.com/apache/spark/pull/1383

[SPARK-1667] Jobs never finish successfully once bucket file missing 
occurred

If jobs execute shuffle, bucket files are created in a temporary directory 
(named like spark-local-*).
When the bucket files are missing cased by disk failure or any reasons, 
jobs cannot execute shuffle which has same shuffle id for the bucket files.

I think when Executors cannot read bucket files from their local directory 
(spark-local-*), they should abort and marked as lost.
In this case, Executor which has bucket files throw FileNotFoundException, 
so I think, we should handle IOException as fatal in Utils.scala to abort.

After I modified the code as follows, an Executor which fetches bucket 
files from failed Executor could retry to fetch from another Executor.


def isFatalError(e: Throwable): Boolean = { 


  e match { 


case _: IOException =  


  true  


case NonFatal(_) | _: InterruptedException | _: NotImplementedError 
| _: ControlThrowable =

  false 


case _ =   


  true  


  } 

  
}  


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sarutak/spark SPARK-1667

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1383.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1383


commit 1fc1ba05d21d9732e8c80273d13792a2879c8f1f
Author: Kousuke Saruta saru...@oss.nttdata.co.jp
Date:   2014-07-12T01:08:26Z

Modify for SPARK-1667

commit c2044d6a53e7767b89c46f92ad5cec4635602bac
Author: Kousuke Saruta saru...@oss.nttdata.co.jp
Date:   2014-07-12T02:31:55Z

Modified Utils.scala to handle IOException as fatal




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1667] Jobs never finish successfully on...

2014-07-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1383#issuecomment-48800479
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---