[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-09-03 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/7927#issuecomment-137572073
  
LGTM merging into master, thanks everyone.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7927#issuecomment-137372105
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-09-03 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7927#issuecomment-137372017
  
  [Test build #41960 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41960/console)
 for   PR 7927 at commit 
[`6c6d53d`](https://github.com/apache/spark/commit/6c6d53d61d293e2c14ed3f776b8fb20397a6b332).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class BlockFetchException(messages: String, throwable: Throwable)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7927#issuecomment-137372107
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41960/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-09-03 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/7927


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-09-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7927#issuecomment-137333165
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-09-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7927#issuecomment-137333173
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-09-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7927#issuecomment-137333723
  
  [Test build #41960 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41960/consoleFull)
 for   PR 7927 at commit 
[`6c6d53d`](https://github.com/apache/spark/commit/6c6d53d61d293e2c14ed3f776b8fb20397a6b332).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-09-02 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/7927#issuecomment-137332857
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-09-02 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/7927#issuecomment-137046713
  
It seems that the failure not related.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-09-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7927#issuecomment-136920246
  
  [Test build #41907 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41907/consoleFull)
 for   PR 7927 at commit 
[`6c6d53d`](https://github.com/apache/spark/commit/6c6d53d61d293e2c14ed3f776b8fb20397a6b332).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-09-01 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/7927#issuecomment-136883210
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-09-01 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r38482388
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -590,10 +590,25 @@ private[spark] class BlockManager(
   private def doGetRemote(blockId: BlockId, asBlockResult: Boolean): 
Option[Any] = {
 require(blockId != null, "BlockId is null")
 val locations = Random.shuffle(master.getLocations(blockId))
+var attemptTimes = 0
 for (loc <- locations) {
   logDebug(s"Getting remote block $blockId from $loc")
-  val data = blockTransferService.fetchBlockSync(
-loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  val data = try {
+blockTransferService.fetchBlockSync(
+  loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  } catch {
+case t: Throwable if attemptTimes < locations.size - 1 =>
+  // Return null when Exception throw, so we can fetch block
+  // from another location if there still have locations
+  attemptTimes += 1
+  logWarning(s"Try $attemptTimes times getting remote block 
$blockId from $loc failed.", t)
+  null
+case t: Throwable =>
+  // Throw BlockFetchException wraps the last Exception when
+  // there is no block we can fetch
+  throw new BlockFetchException(s"Failed to fetch block from" +
+s" ${locations.size} locations. Most recent failure cause:", t)
--- End diff --

I think it's cleaner to merge these two cases:
```
} catch {
  case NonFatal(e) =>
numFetchFailures += 1
if (numFetchFailures == locations.size) {
  throw new BlockFetchException("...", t)
} else {
  logWarning(s"Failed to fetch remote block $blockId from $loc (failed 
attempt $numFailedAttempts)", t)
  // This location failed, so we retry fetch from a different location 
by returning null here
  null
}
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-09-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7927#issuecomment-136919718
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-09-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7927#issuecomment-136919732
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-09-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7927#issuecomment-136911190
  
  [Test build #41899 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41899/console)
 for   PR 7927 at commit 
[`75db334`](https://github.com/apache/spark/commit/75db3340384def0b476482d9d9a8ff90f7b5e9f9).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class BlockFetchException(messages: String, throwable: Throwable)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-09-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7927#issuecomment-136911264
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41899/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-09-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7927#issuecomment-136911263
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-09-01 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r38481057
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -590,10 +590,25 @@ private[spark] class BlockManager(
   private def doGetRemote(blockId: BlockId, asBlockResult: Boolean): 
Option[Any] = {
 require(blockId != null, "BlockId is null")
 val locations = Random.shuffle(master.getLocations(blockId))
+var attemptTimes = 0
 for (loc <- locations) {
   logDebug(s"Getting remote block $blockId from $loc")
-  val data = blockTransferService.fetchBlockSync(
-loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  val data = try {
+blockTransferService.fetchBlockSync(
+  loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  } catch {
+case t: Throwable if attemptTimes < locations.size - 1 =>
--- End diff --

please use `scala.util.controlNonFatal(e)` instead. We don't want to catch 
OOM's here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-09-01 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r38482018
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -590,10 +590,25 @@ private[spark] class BlockManager(
   private def doGetRemote(blockId: BlockId, asBlockResult: Boolean): 
Option[Any] = {
 require(blockId != null, "BlockId is null")
 val locations = Random.shuffle(master.getLocations(blockId))
+var attemptTimes = 0
--- End diff --

can you call this `numFetchAttempts`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-09-01 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/7927#issuecomment-136887215
  
@jeanlyn can you rename the title of this patch and the issue to remove 
references to "broadcast"? I believe this is applicable to all blocks in 
general.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-09-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7927#issuecomment-136884719
  
  [Test build #41899 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41899/consoleFull)
 for   PR 7927 at commit 
[`75db334`](https://github.com/apache/spark/commit/75db3340384def0b476482d9d9a8ff90f7b5e9f9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-09-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7927#issuecomment-136883648
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-09-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7927#issuecomment-136883680
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-09-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7927#issuecomment-136942750
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-09-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7927#issuecomment-136942619
  
  [Test build #41907 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41907/console)
 for   PR 7927 at commit 
[`6c6d53d`](https://github.com/apache/spark/commit/6c6d53d61d293e2c14ed3f776b8fb20397a6b332).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class BlockFetchException(messages: String, throwable: Throwable)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-09-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7927#issuecomment-136942752
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41907/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-09 Thread jeanlyn
Github user jeanlyn commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r36587468
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -590,10 +590,24 @@ private[spark] class BlockManager(
   private def doGetRemote(blockId: BlockId, asBlockResult: Boolean): 
Option[Any] = {
 require(blockId != null, BlockId is null)
 val locations = Random.shuffle(master.getLocations(blockId))
+var attemptTimes = 0
 for (loc - locations) {
   logDebug(sGetting remote block $blockId from $loc)
-  val data = blockTransferService.fetchBlockSync(
-loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  val data = try {
+blockTransferService.fetchBlockSync(
+  loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  } catch {
+case t: Throwable if attemptTimes  locations.size - 1 =
+  // Return null when Exception throw, so we can fetch block
+  // from another location if there still have locations
+  attemptTimes += 1
+  logWarning(sTry $attemptTimes times getting remote block 
$blockId from $loc failed., t)
+  null
+case t: Throwable =
+  // Throw BlockFetchException wraps the last Exception when
+  // there is no block we can fetch
+  throw new BlockFetchException(t)
--- End diff --

Sure!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-09 Thread GraceH
Github user GraceH commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r36596179
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -590,10 +590,21 @@ private[spark] class BlockManager(
   private def doGetRemote(blockId: BlockId, asBlockResult: Boolean): 
Option[Any] = {
 require(blockId != null, BlockId is null)
 val locations = Random.shuffle(master.getLocations(blockId))
+var failTimes = 0
 for (loc - locations) {
   logDebug(sGetting remote block $blockId from $loc)
-  val data = blockTransferService.fetchBlockSync(
-loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  val data = try {
+blockTransferService.fetchBlockSync(
+  loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  } catch {
+case e: IOException if failTimes  locations.size - 1 =
+  // Return null when IOException throw, so we can fetch block
+  // from another location if there still have locations
+  failTimes += 1
+  logWarning(sTry ${failTimes} times getting remote block 
$blockId from $loc failed:, e)
--- End diff --

That's fine @squito. Both option (a) and (b) are acceptable. 

BTW, we'd better to add some document to tell the caller, it throws out an 
Exception. In Java, all expected exceptions are displayed explicitly. It is 
easy to understand which kind of exceptions to be catch in the caller. It seems 
most of the caller for ```bm.getRemoteBytes``` or ```bm.get``` replying on the 
returned back ```Option```, and not being aware of any exception. We can tell 
them if getting blocks failed from all remotes, it  is expected one new 
```Exception``` there as Unit test does. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-07 Thread squito
Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r36556494
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -590,10 +590,21 @@ private[spark] class BlockManager(
   private def doGetRemote(blockId: BlockId, asBlockResult: Boolean): 
Option[Any] = {
 require(blockId != null, BlockId is null)
 val locations = Random.shuffle(master.getLocations(blockId))
+var failTimes = 0
 for (loc - locations) {
   logDebug(sGetting remote block $blockId from $loc)
-  val data = blockTransferService.fetchBlockSync(
-loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  val data = try {
+blockTransferService.fetchBlockSync(
+  loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  } catch {
+case e: IOException if failTimes  locations.size - 1 =
+  // Return null when IOException throw, so we can fetch block
+  // from another location if there still have locations
+  failTimes += 1
+  logWarning(sTry ${failTimes} times getting remote block 
$blockId from $loc failed:, e)
--- End diff --

Hi @GraceH ,
sorry I didn't see your comment earlier.  You are right, catching just 
`IOException` doesn't make sense, no matter what the exception we should keep 
trying all locations.

I think we are in agreement that `doGetRemote` should tolerate exceptions, 
as long as it is able to get from one location (and all those exceptions should 
get logged).

The question is, what should it do when it cannot fetch the block from any 
location?  Should it (a) return `None`, and just log all the exceptions or (b) 
should it throw an exception?  In some ways, it seems the answer should be (a), 
since the api already returns an `Option`, and it would seem that was the whole 
purpose.  But on the other hand, the only way the existing code could return 
`None` is if `locations` were empty -- the actual behavior of the current 
version is to throw an exception if the block isn't available.

I'm really torn between both options, I find I can convince myself of 
either one.  Ultimately I am leaning slightly toward (b), just because we can 
get a slightly better exception.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-07 Thread squito
Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r36556522
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockFetchException.scala ---
@@ -0,0 +1,21 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.storage
+
+private[spark]
+case class BlockFetchException(throwable: Throwable) extends 
Exception(throwable)
--- End diff --

this should extend `SparkException`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-07 Thread squito
Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r36557012
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -590,10 +590,24 @@ private[spark] class BlockManager(
   private def doGetRemote(blockId: BlockId, asBlockResult: Boolean): 
Option[Any] = {
 require(blockId != null, BlockId is null)
 val locations = Random.shuffle(master.getLocations(blockId))
+var attemptTimes = 0
 for (loc - locations) {
   logDebug(sGetting remote block $blockId from $loc)
-  val data = blockTransferService.fetchBlockSync(
-loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  val data = try {
+blockTransferService.fetchBlockSync(
+  loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  } catch {
+case t: Throwable if attemptTimes  locations.size - 1 =
+  // Return null when Exception throw, so we can fetch block
+  // from another location if there still have locations
+  attemptTimes += 1
+  logWarning(sTry $attemptTimes times getting remote block 
$blockId from $loc failed., t)
+  null
+case t: Throwable =
+  // Throw BlockFetchException wraps the last Exception when
+  // there is no block we can fetch
+  throw new BlockFetchException(t)
--- End diff --

can we include a little more info here, eg. `sfailed to fetch block from 
${locations.size} locations.  Most recent failure cause:`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-07 Thread squito
Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r36556866
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -592,8 +592,14 @@ private[spark] class BlockManager(
 val locations = Random.shuffle(master.getLocations(blockId))
 for (loc - locations) {
   logDebug(sGetting remote block $blockId from $loc)
-  val data = blockTransferService.fetchBlockSync(
-loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  val data = try {
+blockTransferService.fetchBlockSync(
+  loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  } catch {
+case e: Throwable =
+  logWarning(sException during getting remote block $blockId from 
$loc, e)
--- End diff --

yes, agreed.  sorry seeing this so late (commented a little more down below)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-05 Thread CodingCat
Github user CodingCat commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r36292639
  
--- Diff: 
core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala ---
@@ -443,6 +448,37 @@ class BlockManagerSuite extends SparkFunSuite with 
Matchers with BeforeAndAfterE
 assert(list2DiskGet.get.readMethod === DataReadMethod.Disk)
   }
 
+  test((SPARK-9591)getRemoteBytes from another location when  IOException 
throw) {
+try {
+  conf.set(spark.network.timeout, 2s)
+  store = makeBlockManager(8000, excutor1)
--- End diff --

executor


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-05 Thread CodingCat
Github user CodingCat commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r36292648
  
--- Diff: 
core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala ---
@@ -443,6 +448,37 @@ class BlockManagerSuite extends SparkFunSuite with 
Matchers with BeforeAndAfterE
 assert(list2DiskGet.get.readMethod === DataReadMethod.Disk)
   }
 
+  test((SPARK-9591)getRemoteBytes from another location when  IOException 
throw) {
+try {
+  conf.set(spark.network.timeout, 2s)
+  store = makeBlockManager(8000, excutor1)
+  store2 = makeBlockManager(8000, excutor2)
--- End diff --

executor


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-05 Thread CodingCat
Github user CodingCat commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r36293219
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -590,10 +590,21 @@ private[spark] class BlockManager(
   private def doGetRemote(blockId: BlockId, asBlockResult: Boolean): 
Option[Any] = {
 require(blockId != null, BlockId is null)
 val locations = Random.shuffle(master.getLocations(blockId))
+var failTimes = 1
 for (loc - locations) {
   logDebug(sGetting remote block $blockId from $loc)
-  val data = blockTransferService.fetchBlockSync(
-loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  val data = try {
+blockTransferService.fetchBlockSync(
+  loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  } catch {
+case e: IOException if failTimes  locations.size =
+  // return null when IOException throw  ,so we can fetch block
--- End diff --

`return null when IOException throw, so we can fetch block`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-05 Thread jeanlyn
Github user jeanlyn commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r36294518
  
--- Diff: 
core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala ---
@@ -443,6 +448,37 @@ class BlockManagerSuite extends SparkFunSuite with 
Matchers with BeforeAndAfterE
 assert(list2DiskGet.get.readMethod === DataReadMethod.Disk)
   }
 
+  test((SPARK-9591)getRemoteBytes from another location when  IOException 
throw) {
+try {
+  conf.set(spark.network.timeout, 2s)
+  store = makeBlockManager(8000, excutor1)
--- End diff --

My bad. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-05 Thread squito
Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r36297450
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -590,10 +590,21 @@ private[spark] class BlockManager(
   private def doGetRemote(blockId: BlockId, asBlockResult: Boolean): 
Option[Any] = {
 require(blockId != null, BlockId is null)
 val locations = Random.shuffle(master.getLocations(blockId))
+var failTimes = 0
 for (loc - locations) {
   logDebug(sGetting remote block $blockId from $loc)
-  val data = blockTransferService.fetchBlockSync(
-loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  val data = try {
+blockTransferService.fetchBlockSync(
+  loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  } catch {
+case e: IOException if failTimes  locations.size - 1 =
+  // Return null when IOException throw, so we can fetch block
+  // from another location if there still have locations
+  failTimes += 1
+  logWarning(sTry ${failTimes} times getting remote block 
$blockId from $loc failed:, e)
--- End diff --

I wonder if we should save the last exception, and throw it if we don't 
find the block in any of the locations.  that would be more consistent with the 
old behavior (throw an exception), but then again, the api returns an option, 
so maybe we should just swallow all exceptions and return `None` as you have 
here.  Also now I see what Sean meant earlier when he suggested pushing this 
down into `fetchBlockSync`.  It looks as if this method is expecting 
`fetchBlockSync` to return null when it encounters an error, so maybe we should 
just change the behavior there after all.  @aarondav any thoughts?  I'm leaning 
towards pushing it into `fetchBlockSync` -- have it swallow the exception and 
return null. (this is the only use of `fetchBlockSync`.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-05 Thread squito
Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r36297624
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -590,10 +590,21 @@ private[spark] class BlockManager(
   private def doGetRemote(blockId: BlockId, asBlockResult: Boolean): 
Option[Any] = {
 require(blockId != null, BlockId is null)
 val locations = Random.shuffle(master.getLocations(blockId))
+var failTimes = 0
 for (loc - locations) {
   logDebug(sGetting remote block $blockId from $loc)
-  val data = blockTransferService.fetchBlockSync(
-loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  val data = try {
+blockTransferService.fetchBlockSync(
+  loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  } catch {
+case e: IOException if failTimes  locations.size - 1 =
+  // Return null when IOException throw, so we can fetch block
+  // from another location if there still have locations
+  failTimes += 1
+  logWarning(sTry ${failTimes} times getting remote block 
$blockId from $loc failed:, e)
+  null
+case t: Throwable = throw t
--- End diff --

you can delete this -- uncaught exceptions are automatically thrown


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-05 Thread squito
Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r36297953
  
--- Diff: 
core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala ---
@@ -443,6 +448,34 @@ class BlockManagerSuite extends SparkFunSuite with 
Matchers with BeforeAndAfterE
 assert(list2DiskGet.get.readMethod === DataReadMethod.Disk)
   }
 
+  test((SPARK-9591)getRemoteBytes from another location when  IOException 
throw) {
+try {
+  conf.set(spark.network.timeout, 2s)
+  store = makeBlockManager(8000, executor1)
+  store2 = makeBlockManager(8000, executor2)
+  store3 = makeBlockManager(8000, executor3)
+  val list1 = List(new Array[Byte](4000))
+  store2.putIterator(list1, list1.iterator, 
StorageLevel.MEMORY_ONLY, tellMaster = true)
+  store3.putIterator(list1, list1.iterator, 
StorageLevel.MEMORY_ONLY, tellMaster = true)
+  var list1Get = store.getRemoteBytes(list1)
+  assert(list1Get.isDefined, list1Get expected to be fetched)
+  // block manager exit
+  store2.stop()
+  store2 = null
+  list1Get = store.getRemoteBytes(list1)
+  // get `list1` block
+  assert(list1Get.isDefined, list1Get expected to be fetched)
+  store3.stop()
+  store3 = null
+  // exception throw because there is no locations
+  intercept[java.io.IOException] {
+list1Get = store.getRemoteBytes(list1)
+  }
+} finally {
+  conf.remove(spark.network.timeout)
--- End diff --

just in case somebody changes this test suite to use a specific network 
timeout, I think you should restore to the previous value, if there was one.

```scala
val origTimeoutOpt = conf.getOption(spark.network.timeout)
try {
 ...
} finally {
  origTimeoutOpt match {
case Some(t) = conf.set(spark.network.timeout, t)
case None = conf.remove(spark.network.timeout)
  }
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-05 Thread squito
Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r36298102
  
--- Diff: 
core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala ---
@@ -443,6 +448,34 @@ class BlockManagerSuite extends SparkFunSuite with 
Matchers with BeforeAndAfterE
 assert(list2DiskGet.get.readMethod === DataReadMethod.Disk)
   }
 
+  test((SPARK-9591)getRemoteBytes from another location when  IOException 
throw) {
+try {
+  conf.set(spark.network.timeout, 2s)
+  store = makeBlockManager(8000, executor1)
+  store2 = makeBlockManager(8000, executor2)
+  store3 = makeBlockManager(8000, executor3)
+  val list1 = List(new Array[Byte](4000))
+  store2.putIterator(list1, list1.iterator, 
StorageLevel.MEMORY_ONLY, tellMaster = true)
+  store3.putIterator(list1, list1.iterator, 
StorageLevel.MEMORY_ONLY, tellMaster = true)
+  var list1Get = store.getRemoteBytes(list1)
+  assert(list1Get.isDefined, list1Get expected to be fetched)
+  // block manager exit
+  store2.stop()
+  store2 = null
+  list1Get = store.getRemoteBytes(list1)
+  // get `list1` block
+  assert(list1Get.isDefined, list1Get expected to be fetched)
+  store3.stop()
+  store3 = null
+  // exception throw because there is no locations
+  intercept[java.io.IOException] {
+list1Get = store.getRemoteBytes(list1)
+  }
--- End diff --

does this test pass?  I thought you are now just logging the exceptions and 
returning `None` in this case?  (which might be fine, given my questions above 
on how we should change the api ...)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-05 Thread squito
Github user squito commented on the pull request:

https://github.com/apache/spark/pull/7927#issuecomment-127991678
  
thanks for updating @jeanlyn .  Sorry that I didn't fully understand the 
issue earlier and for potentially changing the desired outcome on you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-05 Thread jeanlyn
Github user jeanlyn commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r36302564
  
--- Diff: 
core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala ---
@@ -443,6 +448,34 @@ class BlockManagerSuite extends SparkFunSuite with 
Matchers with BeforeAndAfterE
 assert(list2DiskGet.get.readMethod === DataReadMethod.Disk)
   }
 
+  test((SPARK-9591)getRemoteBytes from another location when  IOException 
throw) {
+try {
+  conf.set(spark.network.timeout, 2s)
+  store = makeBlockManager(8000, executor1)
+  store2 = makeBlockManager(8000, executor2)
+  store3 = makeBlockManager(8000, executor3)
+  val list1 = List(new Array[Byte](4000))
+  store2.putIterator(list1, list1.iterator, 
StorageLevel.MEMORY_ONLY, tellMaster = true)
+  store3.putIterator(list1, list1.iterator, 
StorageLevel.MEMORY_ONLY, tellMaster = true)
+  var list1Get = store.getRemoteBytes(list1)
+  assert(list1Get.isDefined, list1Get expected to be fetched)
+  // block manager exit
+  store2.stop()
+  store2 = null
+  list1Get = store.getRemoteBytes(list1)
+  // get `list1` block
+  assert(list1Get.isDefined, list1Get expected to be fetched)
+  store3.stop()
+  store3 = null
+  // exception throw because there is no locations
+  intercept[java.io.IOException] {
+list1Get = store.getRemoteBytes(list1)
+  }
--- End diff --

Yes,the test pass in my local laptop. I only catch `IOException` when there 
still has locations we can fetch the block.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-05 Thread GraceH
Github user GraceH commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r36372781
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -590,10 +590,21 @@ private[spark] class BlockManager(
   private def doGetRemote(blockId: BlockId, asBlockResult: Boolean): 
Option[Any] = {
 require(blockId != null, BlockId is null)
 val locations = Random.shuffle(master.getLocations(blockId))
+var failTimes = 0
 for (loc - locations) {
   logDebug(sGetting remote block $blockId from $loc)
-  val data = blockTransferService.fetchBlockSync(
-loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  val data = try {
+blockTransferService.fetchBlockSync(
+  loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  } catch {
+case e: IOException if failTimes  locations.size - 1 =
+  // Return null when IOException throw, so we can fetch block
+  // from another location if there still have locations
+  failTimes += 1
+  logWarning(sTry ${failTimes} times getting remote block 
$blockId from $loc failed:, e)
--- End diff --

@squito We'd better to catch up the exception to avoid working flow to be 
broken.   The single fetch attempt should not break the entire code path. No 
matter what the exception type is.

Regarding the place to catch that exception. For the debugging or 
diagnosis, we'd better to mark down how many remote fetch failures there and 
why. And log out all the fetches are failed or not.  It seems to be more 
reasonable to catch the exception in ```getRemote```. ```getRemote``` itself 
can tolerant  certain fetch failures from parts of the remotes. It only 
requires single success. That is the design for ```getRemote```.

The ```fetchBlockSync``` is a function to tell if fetch fail or success. If 
success, return back the data. If not, throw out the exception to indicate why. 
To swallow the exception there seems not so that reasonable. The upper level 
function can decide if the exception or fetch failure is acceptable or not. 

What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-04 Thread CodingCat
Github user CodingCat commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r36179785
  
--- Diff: 
core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala ---
@@ -443,6 +443,21 @@ class BlockManagerSuite extends SparkFunSuite with 
Matchers with BeforeAndAfterE
 assert(list2DiskGet.get.readMethod === DataReadMethod.Disk)
   }
 
+  test(block manager crash test) {
+conf.set(spark.network.timeout, 5s)
+store = makeBlockManager(8000)
+store2 = makeBlockManager(8000)
+val list1 = List(new Array[Byte](4000))
+store2.putIterator(list1, list1.iterator, StorageLevel.MEMORY_ONLY, 
tellMaster = true)
+val list1get = store.getRemote(list1)
+assert(list1get.isDefined, list1get expected to be fetched)
+//simulate block manager crash
+store2.stop()
+val list1getagain = store.getRemoteBytes(list1)
--- End diff --

camel case


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-04 Thread jeanlyn
Github user jeanlyn commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r36182323
  
--- Diff: 
core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala ---
@@ -443,6 +443,21 @@ class BlockManagerSuite extends SparkFunSuite with 
Matchers with BeforeAndAfterE
 assert(list2DiskGet.get.readMethod === DataReadMethod.Disk)
   }
 
+  test(block manager crash test) {
+conf.set(spark.network.timeout, 5s)
+store = makeBlockManager(8000)
+store2 = makeBlockManager(8000)
+val list1 = List(new Array[Byte](4000))
+store2.putIterator(list1, list1.iterator, StorageLevel.MEMORY_ONLY, 
tellMaster = true)
+val list1get = store.getRemote(list1)
--- End diff --

Thanks,I will fix it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-04 Thread jeanlyn
Github user jeanlyn commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r36182313
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -592,8 +592,14 @@ private[spark] class BlockManager(
 val locations = Random.shuffle(master.getLocations(blockId))
 for (loc - locations) {
   logDebug(sGetting remote block $blockId from $loc)
-  val data = blockTransferService.fetchBlockSync(
-loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  val data = try {
+blockTransferService.fetchBlockSync(
+  loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  } catch {
+case e: Throwable =
+  logWarning(sException during getting remote block $blockId from 
$loc, e)
--- End diff --

Thanks @srowen and @CodingCat for comments!
* If i am understanding correctly,the `doGetRemote` here will return `None` 
when fetched all the same block is null, and all the method which called the 
`doGetRemote` will handle the `None` case  and throw exception when 
necessary,so i think it's safe here to catch the exception
* `fetchBlockSync` just call `fetchBlocks` to fetch the block,so i think 
it's the same we catch exception here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-04 Thread CodingCat
Github user CodingCat commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r36179774
  
--- Diff: 
core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala ---
@@ -443,6 +443,21 @@ class BlockManagerSuite extends SparkFunSuite with 
Matchers with BeforeAndAfterE
 assert(list2DiskGet.get.readMethod === DataReadMethod.Disk)
   }
 
+  test(block manager crash test) {
+conf.set(spark.network.timeout, 5s)
+store = makeBlockManager(8000)
+store2 = makeBlockManager(8000)
+val list1 = List(new Array[Byte](4000))
+store2.putIterator(list1, list1.iterator, StorageLevel.MEMORY_ONLY, 
tellMaster = true)
+val list1get = store.getRemote(list1)
--- End diff --

camel case naming style...so it is list1Get, 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-04 Thread CodingCat
Github user CodingCat commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r36180197
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -592,8 +592,14 @@ private[spark] class BlockManager(
 val locations = Random.shuffle(master.getLocations(blockId))
 for (loc - locations) {
   logDebug(sGetting remote block $blockId from $loc)
-  val data = blockTransferService.fetchBlockSync(
-loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  val data = try {
+blockTransferService.fetchBlockSync(
+  loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  } catch {
+case e: Throwable =
+  logWarning(sException during getting remote block $blockId from 
$loc, e)
--- End diff --

agree with @srowen, otherwise it might hide other evils


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7927#issuecomment-127541931
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-04 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r36172665
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -592,8 +592,14 @@ private[spark] class BlockManager(
 val locations = Random.shuffle(master.getLocations(blockId))
 for (loc - locations) {
   logDebug(sGetting remote block $blockId from $loc)
-  val data = blockTransferService.fetchBlockSync(
-loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  val data = try {
+blockTransferService.fetchBlockSync(
+  loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  } catch {
+case e: Throwable =
+  logWarning(sException during getting remote block $blockId from 
$loc, e)
--- End diff --

I'm not sure it's valid to catch any throwable here and continue. This can 
be more targeted, and pushed down into `fetchBlockSync`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7927#issuecomment-127541700
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-04 Thread jeanlyn
GitHub user jeanlyn opened a pull request:

https://github.com/apache/spark/pull/7927

[SPARK-9591][CORE]Job may  fail for exception during getting broadcast 
variable

[SPARK-9591](https://issues.apache.org/jira/browse/SPARK-9591) 
When we getting the broadcast variable, we can fetch the block form several 
location,but now when connecting the lost blockmanager(idle for enough time 
removed by driver when using dynamic resource allocate and so on) will cause 
task fail,and the worse case will cause the job fail.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jeanlyn/spark catch_exception

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7927.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7927


commit f8340a2ec880da41a36e3ae8bf87c5ac5badc919
Author: jeanlyn jeanly...@gmail.com
Date:   2015-08-04T09:23:29Z

catch exception avoid task fail




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-04 Thread CodingCat
Github user CodingCat commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r36180526
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -592,8 +592,14 @@ private[spark] class BlockManager(
 val locations = Random.shuffle(master.getLocations(blockId))
 for (loc - locations) {
   logDebug(sGetting remote block $blockId from $loc)
-  val data = blockTransferService.fetchBlockSync(
-loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  val data = try {
--- End diff --

With my experience, if you assign to a variable with a set of statements, 
you have to encapsulate those statements with a `{}`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-04 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r36180800
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -592,8 +592,14 @@ private[spark] class BlockManager(
 val locations = Random.shuffle(master.getLocations(blockId))
 for (loc - locations) {
   logDebug(sGetting remote block $blockId from $loc)
-  val data = blockTransferService.fetchBlockSync(
-loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  val data = try {
--- End diff --

The construct should be fine. It's one statement: try-(block)-catch-(block)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-04 Thread squito
Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r36202856
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -592,8 +592,14 @@ private[spark] class BlockManager(
 val locations = Random.shuffle(master.getLocations(blockId))
 for (loc - locations) {
   logDebug(sGetting remote block $blockId from $loc)
-  val data = blockTransferService.fetchBlockSync(
-loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  val data = try {
+blockTransferService.fetchBlockSync(
+  loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  } catch {
+case e: Throwable =
+  logWarning(sException during getting remote block $blockId from 
$loc, e)
--- End diff --

I think the point is that: since you are expecting a `IOException` (or 
whatever it is) when one of the remotes goes down, than you should only catch 
that exception.  If we get some other weird random exception, we should 
probably still throw it, since it might be some bigger problem.

Also I don't think simply ignoring the exception is right.  If you only get 
an exception from one location, but then another location is fine, sure, just 
forget the exception.  But what if you get an exception from all locations?  
Then you should still throw an exception.  You could do something like what is 
done in 
[`askWithRetry`](https://github.com/apache/spark/blob/d702d53732b44e8242448ce5302738bd130717d8/core/src/main/scala/org/apache/spark/rpc/RpcEndpointRef.scala#L96).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-04 Thread squito
Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r36204820
  
--- Diff: 
core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala ---
@@ -443,6 +443,21 @@ class BlockManagerSuite extends SparkFunSuite with 
Matchers with BeforeAndAfterE
 assert(list2DiskGet.get.readMethod === DataReadMethod.Disk)
   }
 
+  test(block manager crash test) {
+conf.set(spark.network.timeout, 5s)
+store = makeBlockManager(8000)
+store2 = makeBlockManager(8000)
+val list1 = List(new Array[Byte](4000))
+store2.putIterator(list1, list1.iterator, StorageLevel.MEMORY_ONLY, 
tellMaster = true)
+val list1Get = store.getRemote(list1)
+assert(list1Get.isDefined, list1get expected to be fetched)
+// simulate block manager crashed
+store2.stop()
+val list1GetAgain = store.getRemoteBytes(list1)
+assert(!list1GetAgain.isDefined, list1getagain expected to be not 
fetched)
--- End diff --

I don't think you are testing the right thing here at all ... I think in 
this case, we still want to throw an exception (as I commented above), 
otherwise there is a big behavior change.

I thought the behavior that you want is, If you store the block in 
*multiple* locations, but then you lose one of them, `getRemoteBytes` shouldn't 
fail -- it should keep trying till it finds a valid location.

To make your test take multiple peered block managers into account, you 
need to pass a second arg to `makeBlockManager`, to give them each a unique id, 
like is done in the [master + 2 managers interaction 
test](https://github.com/apache/spark/blob/d702d53732b44e8242448ce5302738bd130717d8/core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala#L180)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-04 Thread GraceH
Github user GraceH commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r36271550
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -592,8 +592,14 @@ private[spark] class BlockManager(
 val locations = Random.shuffle(master.getLocations(blockId))
 for (loc - locations) {
   logDebug(sGetting remote block $blockId from $loc)
-  val data = blockTransferService.fetchBlockSync(
-loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  val data = try {
+blockTransferService.fetchBlockSync(
+  loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  } catch {
+case e: Throwable =
+  logWarning(sException during getting remote block $blockId from 
$loc, e)
--- End diff --

@squito So agree to do like ```askWithRetry```.  If we can get one block 
from any remote store successfully, it successes. We should not break the 
working path whenever meet the first exception.

So maybe, we need to catch all kinds of Exceptions (not IOException only). 
If some attempts failed, we need to log out the exception information but 
continue the fetching work.  When we run to the final location and it still 
throws out certain exception, we need to throw out a NEW exception to tell that 
all attempts failed (i.e., no available location there). and meanwhile, maybe 
to add the last exception information into this NEW exception. 

But if we only focus IOException, when we meet some types of exceptions for 
certain locations, it still breaks the entire workflow (to fetch data from the 
rest locations if possible). 

What do you think?  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-04 Thread jeanlyn
Github user jeanlyn commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r36217402
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -592,8 +592,14 @@ private[spark] class BlockManager(
 val locations = Random.shuffle(master.getLocations(blockId))
 for (loc - locations) {
   logDebug(sGetting remote block $blockId from $loc)
-  val data = blockTransferService.fetchBlockSync(
-loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  val data = try {
+blockTransferService.fetchBlockSync(
+  loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  } catch {
+case e: Throwable =
+  logWarning(sException during getting remote block $blockId from 
$loc, e)
--- End diff --

Make sense,I will fix it later,thanks @squito a lot!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-04 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r36235472
  
--- Diff: 
core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala ---
@@ -443,6 +443,21 @@ class BlockManagerSuite extends SparkFunSuite with 
Matchers with BeforeAndAfterE
 assert(list2DiskGet.get.readMethod === DataReadMethod.Disk)
   }
 
+  test(block manager crash test) {
+conf.set(spark.network.timeout, 5s)
+store = makeBlockManager(8000)
+store2 = makeBlockManager(8000)
+val list1 = List(new Array[Byte](4000))
+store2.putIterator(list1, list1.iterator, StorageLevel.MEMORY_ONLY, 
tellMaster = true)
+val list1Get = store.getRemote(list1)
+assert(list1Get.isDefined, list1get expected to be fetched)
+// simulate block manager crashed
+store2.stop()
+val list1GetAgain = store.getRemoteBytes(list1)
+assert(!list1GetAgain.isDefined, list1getagain expected to be not 
fetched)
+conf.set(spark.network.timeout, 120s)
--- End diff --

This statement looks like it should be unnecessary but I guess it's not 
since it looks like this test is mutating a shared configuration. Is there any 
way to avoid this type of coupling between tests or to ensure that the previous 
state is restored in a `finally` block or `afterEach()` method?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-04 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r36235337
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -592,8 +592,14 @@ private[spark] class BlockManager(
 val locations = Random.shuffle(master.getLocations(blockId))
 for (loc - locations) {
   logDebug(sGetting remote block $blockId from $loc)
-  val data = blockTransferService.fetchBlockSync(
-loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  val data = try {
+blockTransferService.fetchBlockSync(
+  loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  } catch {
+case e: Throwable =
+  logWarning(sException during getting remote block $blockId from 
$loc, e)
+  null
--- End diff --

Why is it safe to return `null` here if errors occur? After you explain 
this, can you add a comment so this explanation is recorded somewhere?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-04 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r36235201
  
--- Diff: 
core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala ---
@@ -443,6 +443,21 @@ class BlockManagerSuite extends SparkFunSuite with 
Matchers with BeforeAndAfterE
 assert(list2DiskGet.get.readMethod === DataReadMethod.Disk)
   }
 
+  test(block manager crash test) {
--- End diff --

Can you reference a JIRA in the test title or otherwise provide more 
description of what case is supposed to be tested here? This isn't a super 
descriptive name as it stands now.  Also, the test name probably doesn't need 
to include the word test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org