[GitHub] spark pull request #16250: [CORE][MINOR] Stylistic changes in DAGScheduler (...

2016-12-13 Thread jaceklaskowski
Github user jaceklaskowski closed the pull request at:

https://github.com/apache/spark/pull/16250


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16250: [CORE][MINOR] Stylistic changes in DAGScheduler (...

2016-12-12 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/16250#discussion_r92062052
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -333,16 +333,16 @@ class DAGScheduler(
   // (so we don't unnecessarily re-compute that data).
   val serLocs = 
mapOutputTracker.getSerializedMapOutputStatuses(shuffleDep.shuffleId)
   val locs = MapOutputTracker.deserializeMapStatuses(serLocs)
-  (0 until locs.length).foreach { i =>
-if (locs(i) ne null) {
-  // locs(i) will be null if missing
-  stage.addOutputLoc(i, locs(i))
+  locs.zipWithIndex
--- End diff --

Its hard to profile these things closely enough to figure out how much 
overhead it adds (depends on locs.length of course etc.). My vote would be to 
be conservative and keep the existing version


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16250: [CORE][MINOR] Stylistic changes in DAGScheduler (...

2016-12-12 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16250#discussion_r92072568
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -333,16 +333,16 @@ class DAGScheduler(
   // (so we don't unnecessarily re-compute that data).
   val serLocs = 
mapOutputTracker.getSerializedMapOutputStatuses(shuffleDep.shuffleId)
   val locs = MapOutputTracker.deserializeMapStatuses(serLocs)
-  (0 until locs.length).foreach { i =>
-if (locs(i) ne null) {
-  // locs(i) will be null if missing
-  stage.addOutputLoc(i, locs(i))
+  locs.zipWithIndex
--- End diff --

I'd be wary of making this type of change since there are cases where we've 
purposely removed Scala-style transformations in hot scheduler loops.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16250: [CORE][MINOR] Stylistic changes in DAGScheduler (...

2016-12-11 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16250#discussion_r91862235
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -333,16 +333,16 @@ class DAGScheduler(
   // (so we don't unnecessarily re-compute that data).
   val serLocs = 
mapOutputTracker.getSerializedMapOutputStatuses(shuffleDep.shuffleId)
   val locs = MapOutputTracker.deserializeMapStatuses(serLocs)
-  (0 until locs.length).foreach { i =>
-if (locs(i) ne null) {
-  // locs(i) will be null if missing
-  stage.addOutputLoc(i, locs(i))
+  locs.zipWithIndex
--- End diff --

just fyi this is going to be slower than the original code. I don't know if 
it matters, but it will be slower.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16250: [CORE][MINOR] Stylistic changes in DAGScheduler (...

2016-12-11 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/16250#discussion_r91860470
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -333,16 +333,16 @@ class DAGScheduler(
   // (so we don't unnecessarily re-compute that data).
   val serLocs = 
mapOutputTracker.getSerializedMapOutputStatuses(shuffleDep.shuffleId)
   val locs = MapOutputTracker.deserializeMapStatuses(serLocs)
-  (0 until locs.length).foreach { i =>
-if (locs(i) ne null) {
-  // locs(i) will be null if missing
-  stage.addOutputLoc(i, locs(i))
+  locs.zipWithIndex
+.filter { case (status, _) => status != null } // null if missing
+.foreach { case (ms, idx) =>
+  stage.addOutputLoc(idx, ms)
 }
-  }
 } else {
   // Kind of ugly: need to register RDDs with the cache and map output 
tracker here
   // since we can't do it in the RDD constructor because # of 
partitions is unknown
-  logInfo("Registering RDD " + rdd.id + " (" + rdd.getCreationSite + 
")")
+  logInfo(s"Registering RDD ${rdd.id} (partitions: 
${rdd.partitions.length}, " +
--- End diff --

Yes, but it was because I think it's important at this "stage". Do you 
think I should remove the partition-related addition?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16250: [CORE][MINOR] Stylistic changes in DAGScheduler (...

2016-12-11 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16250#discussion_r91857713
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -1227,12 +1220,19 @@ class DAGScheduler(
   }
 }
 submitWaitingChildStages(shuffleStage)
+  } else {
+// Some tasks had failed; let's resubmit this shuffleStage
+// TODO: Lower-level scheduler should also deal with this
+logInfo(s"Resubmitting $shuffleStage 
(${shuffleStage.name}) " +
+  "because some of its tasks had failed: " +
+  shuffleStage.findMissingPartitions().mkString(", "))
--- End diff --

If you bother with this, could go all the way and pull this out as a val


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16250: [CORE][MINOR] Stylistic changes in DAGScheduler (...

2016-12-11 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16250#discussion_r91857696
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -333,16 +333,16 @@ class DAGScheduler(
   // (so we don't unnecessarily re-compute that data).
   val serLocs = 
mapOutputTracker.getSerializedMapOutputStatuses(shuffleDep.shuffleId)
   val locs = MapOutputTracker.deserializeMapStatuses(serLocs)
-  (0 until locs.length).foreach { i =>
-if (locs(i) ne null) {
-  // locs(i) will be null if missing
-  stage.addOutputLoc(i, locs(i))
+  locs.zipWithIndex
+.filter { case (status, _) => status != null } // null if missing
+.foreach { case (ms, idx) =>
--- End diff --

Use ms or status consistency?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16250: [CORE][MINOR] Stylistic changes in DAGScheduler (...

2016-12-11 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16250#discussion_r91857699
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -333,16 +333,16 @@ class DAGScheduler(
   // (so we don't unnecessarily re-compute that data).
   val serLocs = 
mapOutputTracker.getSerializedMapOutputStatuses(shuffleDep.shuffleId)
   val locs = MapOutputTracker.deserializeMapStatuses(serLocs)
-  (0 until locs.length).foreach { i =>
-if (locs(i) ne null) {
-  // locs(i) will be null if missing
-  stage.addOutputLoc(i, locs(i))
+  locs.zipWithIndex
+.filter { case (status, _) => status != null } // null if missing
+.foreach { case (ms, idx) =>
+  stage.addOutputLoc(idx, ms)
 }
-  }
 } else {
   // Kind of ugly: need to register RDDs with the cache and map output 
tracker here
   // since we can't do it in the RDD constructor because # of 
partitions is unknown
-  logInfo("Registering RDD " + rdd.id + " (" + rdd.getCreationSite + 
")")
+  logInfo(s"Registering RDD ${rdd.id} (partitions: 
${rdd.partitions.length}, " +
--- End diff --

No big deal but this is changing the log message too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16250: [CORE][MINOR] Stylistic changes in DAGScheduler (...

2016-12-11 Thread jaceklaskowski
GitHub user jaceklaskowski opened a pull request:

https://github.com/apache/spark/pull/16250

[CORE][MINOR] Stylistic changes in DAGScheduler (to ease comprehensio…

## What changes were proposed in this pull request?

Stylistic changes in `DAGScheduler` to ease comprehension thereof.

## How was this patch tested?

Local build. Awaiting Jenkins to run the entire test suite.

…n thereof)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jaceklaskowski/spark dagscheduler-minors

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16250.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16250


commit bc39671cf6868f033b69179cbdc6e6c373904149
Author: Jacek Laskowski 
Date:   2016-12-11T15:00:52Z

[CORE][MINOR] Stylistic changes in DAGScheduler (to ease comprehension 
thereof)




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org