[GitHub] spark pull request #13473: [SPARK-15736][CORE] Gracefully handle loss of Dis...

2016-06-02 Thread tedyu
Github user tedyu commented on a diff in the pull request:

https://github.com/apache/spark/pull/13473#discussion_r65649617
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -403,6 +403,17 @@ private[spark] class BlockManager(
   }
 
   /**
+   * Cleanup code run in response to a failed local read.
+   * Must be called while holding a read lock on the block.
+   */
+  private def handleLocalReadFailure(blockId: BlockId): Nothing = {
+releaseLock(blockId)
+// Remove the missing block so that its unavailability is reported to 
the driver
+removeBlock(blockId)
--- End diff --

Looking at BlockInfoManager#lockForWriting(), I think you're right.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13473: [SPARK-15736][CORE] Gracefully handle loss of Dis...

2016-06-02 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/13473#discussion_r65648664
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -403,6 +403,17 @@ private[spark] class BlockManager(
   }
 
   /**
+   * Cleanup code run in response to a failed local read.
+   * Must be called while holding a read lock on the block.
+   */
+  private def handleLocalReadFailure(blockId: BlockId): Nothing = {
+releaseLock(blockId)
+// Remove the missing block so that its unavailability is reported to 
the driver
+removeBlock(blockId)
--- End diff --

No, I don't think so: internally, `removeBlock` acquires a write lock on 
the block, so if we called it before the `releaseLock` call then we'd be 
calling it while holding a read lock which would cause us to deadlock.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13473: [SPARK-15736][CORE] Gracefully handle loss of Dis...

2016-06-02 Thread tedyu
Github user tedyu commented on a diff in the pull request:

https://github.com/apache/spark/pull/13473#discussion_r65648116
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -403,6 +403,17 @@ private[spark] class BlockManager(
   }
 
   /**
+   * Cleanup code run in response to a failed local read.
+   * Must be called while holding a read lock on the block.
+   */
+  private def handleLocalReadFailure(blockId: BlockId): Nothing = {
+releaseLock(blockId)
+// Remove the missing block so that its unavailability is reported to 
the driver
+removeBlock(blockId)
--- End diff --

Should this be called before the releaseLock() call ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13473: [SPARK-15736][CORE] Gracefully handle loss of Dis...

2016-06-02 Thread tedyu
Github user tedyu commented on a diff in the pull request:

https://github.com/apache/spark/pull/13473#discussion_r65648040
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -403,6 +403,17 @@ private[spark] class BlockManager(
   }
 
   /**
+   * Cleanup code run in response to a failed local read.
+   * Must be called while holding a read lock on the block.
+   */
+  private def handleLocalReadFailure(blockId: BlockId): Nothing = {
+releaseLock(blockId)
+// Remove the missing block so that its unavailability is reported to 
the driver
+removeBlock(blockId)
--- End diff --

Should this be called before the releaseLock() call ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13473: [SPARK-15736][CORE] Gracefully handle loss of Dis...

2016-06-02 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/13473


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13473: [SPARK-15736][CORE] Gracefully handle loss of Dis...

2016-06-02 Thread JoshRosen
GitHub user JoshRosen opened a pull request:

https://github.com/apache/spark/pull/13473

[SPARK-15736][CORE] Gracefully handle loss of DiskStore files

If an RDD partition is cached on disk and the DiskStore file is lost, then 
reads of that cached partition will fail and the missing partition is supposed 
to be recomputed by a new task attempt. However, the current behavior is to 
repeatedly re-attempt the read on the same machine without performing any 
recomputation, which leads to a complete job failure.

In order to fix this problem, the executor with the missing file needs to 
properly mark the corresponding block as missing so that it stops advertising 
itself as a cache location for that block.

This patch fixes this bug and adds an end-to-end regression test (in 
`FailureSuite`) and a set of unit tests (`in BlockManagerSuite`).

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/JoshRosen/spark handle-missing-cache-files

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13473.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13473


commit 92bfce821c998c0b1a5b7ab3b674d0e59a7c0033
Author: Josh Rosen 
Date:   2016-06-02T20:30:42Z

Add failing regression test.

commit a104ac534be3df927a21aeb42fe6c11e1cd03636
Author: Josh Rosen 
Date:   2016-06-02T21:08:43Z

Add failing unit tests in BlockManagerSuite.

commit e26b2f6aa0571392730338960666ca66f5901f35
Author: Josh Rosen 
Date:   2016-06-02T21:09:02Z

Fix actual bug.

commit 0c51bb63e979549f2fad4fc14767ac6a19952e8c
Author: Josh Rosen 
Date:   2016-06-02T21:21:22Z

Update test names.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org