[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-03-25 Thread ilganeli
Github user ilganeli closed the pull request at:

https://github.com/apache/spark/pull/4703


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-03-23 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/4703#issuecomment-85319636
  
Hey @ilganeli would you mind closing this one since reviews are going on in 
the newer one?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-03-09 Thread ilganeli
Github user ilganeli commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-77930465
  
Hi @srowen and @markhamstra , with the exception of the refactoring for 
newResultStage that Mark proposed, I've added the recommended changes. Is there 
anything else that this needs? Would appreciate the feedback. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-03-04 Thread ilganeli
Github user ilganeli commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-77263660
  
Hi all  - I had to revert to my initial implementation since Mark's 
suggested refactoring introduced a test failure. Is this good to go?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-03-04 Thread markhamstra
Github user markhamstra commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-77267879
  
Interesting.  Looks like the failed test results are no longer available.  
Do you recall what the problem was?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-03-04 Thread ilganeli
Github user ilganeli commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-77272127
  
It was a pretty obscure error. I could revert and give you the stack trace 
but I played around with it a bit and wasn't able to trace it down. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-24 Thread ilganeli
Github user ilganeli commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75829212
  
Hi Sean - I'll give the retest a shot. If there's a way to get 
whitelisted that would be amazing. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75838380
  
  [Test build #27900 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27900/consoleFull)
 for   PR 4708 at commit 
[`c9288e2`](https://github.com/apache/spark/commit/c9288e2d91ded0efcb6280b18fffae84a9abfba6).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75828243
  
  [Test build #27900 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27900/consoleFull)
 for   PR 4708 at commit 
[`c9288e2`](https://github.com/apache/spark/commit/c9288e2d91ded0efcb6280b18fffae84a9abfba6).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75838391
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27900/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-24 Thread ilganeli
Github user ilganeli commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75824748
  
Retest please :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-24 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75827924
  
Looks like a transient git failure yes. I wonder if you have the power to 
tell Jenkins to retest? retest this please seems to work for me. If not maybe 
we can get you whitelisted.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75855424
  
  [Test build #27911 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27911/consoleFull)
 for   PR 4708 at commit 
[`c9288e2`](https://github.com/apache/spark/commit/c9288e2d91ded0efcb6280b18fffae84a9abfba6).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75865162
  
  [Test build #27912 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27912/consoleFull)
 for   PR 4708 at commit 
[`1b3471b`](https://github.com/apache/spark/commit/1b3471b110d5c8b499c49185560f0f8788850996).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75856293
  
  [Test build #27912 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27912/consoleFull)
 for   PR 4708 at commit 
[`1b3471b`](https://github.com/apache/spark/commit/1b3471b110d5c8b499c49185560f0f8788850996).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75865176
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27912/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75881056
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27917/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75881048
  
  [Test build #27917 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27917/consoleFull)
 for   PR 4708 at commit 
[`6b43d7b`](https://github.com/apache/spark/commit/6b43d7b4e94025fdba6263d2627a18cb47414e7d).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75864145
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27911/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75864128
  
  [Test build #27911 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27911/consoleFull)
 for   PR 4708 at commit 
[`c9288e2`](https://github.com/apache/spark/commit/c9288e2d91ded0efcb6280b18fffae84a9abfba6).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-24 Thread ilganeli
Github user ilganeli commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75855242
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75868482
  
  [Test build #27917 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27917/consoleFull)
 for   PR 4708 at commit 
[`6b43d7b`](https://github.com/apache/spark/commit/6b43d7b4e94025fdba6263d2627a18cb47414e7d).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-23 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75593300
  
  [Test build #27854 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27854/consoleFull)
 for   PR 4708 at commit 
[`b85c5fe`](https://github.com/apache/spark/commit/b85c5fe14fdece4769fc98bbedcba80252b325bf).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-23 Thread ilganeli
Github user ilganeli commented on a diff in the pull request:

https://github.com/apache/spark/pull/4708#discussion_r25183148
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -830,39 +836,39 @@ class DAGScheduler(
 try {
   // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).
   // For ResultTask, serialize and broadcast (rdd, func).
-  val taskBinaryBytes: Array[Byte] =
-if (stage.isShuffleMap) {
-  closureSerializer.serialize((stage.rdd, stage.shuffleDep.get) : 
AnyRef).array()
-} else {
-  closureSerializer.serialize((stage.rdd, 
stage.resultOfJob.get.func) : AnyRef).array()
-}
+  val taskBinaryBytes: Array[Byte] = stage match {
+case a: ShuffleMapStage =
+  closureSerializer.serialize((a.rdd, a.shuffleDep): 
AnyRef).array()
+case b: ResultStage =
+  closureSerializer.serialize((b.rdd, b.resultOfJob.get.func): 
AnyRef).array()
+  }
+
   taskBinary = sc.broadcast(taskBinaryBytes)
 } catch {
   // In the case of a failure during serialization, abort the stage.
   case e: NotSerializableException =
 abortStage(stage, Task not serializable:  + e.toString)
 runningStages -= stage
-return
--- End diff --

This was a mistake introduced when I was doing the second round of 
refactoring (copying code back from when I pulled this all out to its own 
method). When this code is within its own method then we can just look at the 
return value of the method and the weird return breaks become unnecessary. I'll 
add a comment for these in the meantime. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-23 Thread markhamstra
Github user markhamstra commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75637201
  
Look pretty good to me, but left a few more comments.  Also, please take a 
look at the various logging strings to see whether some of them can be 
expressed more readably using string interpolation. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-23 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/4708#discussion_r25202931
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -210,40 +210,58 @@ class DAGScheduler(
* The jobId value passed in will be used if the stage doesn't already 
exist with
* a lower jobId (jobId always increases across jobs.)
*/
-  private def getShuffleMapStage(shuffleDep: ShuffleDependency[_, _, _], 
jobId: Int): Stage = {
+  private def getShuffleMapStage(
+  shuffleDep: ShuffleDependency[_, _, _],
+  jobId: Int): ShuffleMapStage = {
 shuffleToMapStage.get(shuffleDep.shuffleId) match {
   case Some(stage) = stage
   case None =
 // We are going to register ancestor shuffle dependencies
 registerShuffleDependencies(shuffleDep, jobId)
 // Then register current shuffleDep
-val stage =
-  newOrUsedStage(
-shuffleDep.rdd, shuffleDep.rdd.partitions.size, shuffleDep, 
jobId,
-shuffleDep.rdd.creationSite)
+val stage = newOrUsedShuffleStage(shuffleDep, jobId)
 shuffleToMapStage(shuffleDep.shuffleId) = stage
- 
+
 stage
 }
   }
 
   /**
-   * Create a Stage -- either directly for use as a result stage, or as 
part of the (re)-creation
-   * of a shuffle map stage in newOrUsedStage.  The stage will be 
associated with the provided
-   * jobId. Production of shuffle map stages should always use 
newOrUsedStage, not newStage
-   * directly.
+   * Create a ShuffleMapStage as part of the (re)-creation of a shuffle 
map stage in
+   * newOrUsedShuffleStage.  The stage will be associated with the provide 
jobId.
+   * Production of shuffle map stages should always use 
newOrUsedShuffleStage,not
+   * newShuffleMapStage directly.
*/
-  private def newStage(
+  private def newShuffleMapStage(
   rdd: RDD[_],
   numTasks: Int,
-  shuffleDep: Option[ShuffleDependency[_, _, _]],
+  shuffleDep: ShuffleDependency[_, _, _],
   jobId: Int,
-  callSite: CallSite)
-: Stage =
-  {
+  callSite: CallSite): ShuffleMapStage = {
 val parentStages = getParentStages(rdd, jobId)
 val id = nextStageId.getAndIncrement()
-val stage = new Stage(id, rdd, numTasks, shuffleDep, parentStages, 
jobId, callSite)
+val stage: ShuffleMapStage = new ShuffleMapStage(id, rdd, numTasks, 
parentStages,
+  jobId, callSite, shuffleDep)
+
+stageIdToStage(id) = stage
+updateJobIdStageIdMaps(jobId, stage)
+stage
+  }
+
+  /**
+   * Create a ResultStage -- either directly for use as a result stage, or 
as part of the
+   * (re)-creation of a shuffle map stage in newOrUsedShuffleStage.  The 
stage will be associated
+   * with the provided jobId.
+   */
+  private def newResultStage(
+  rdd: RDD[_],
+  numTasks: Int,
+  jobId: Int,
+  callSite: CallSite): ResultStage = {
+val parentStages = getParentStages(rdd, jobId)
+val id = nextStageId.getAndIncrement()
+val stage: ResultStage = new ResultStage(id, rdd, numTasks, 
parentStages, jobId, callSite)
+
--- End diff --

I'd rather avoid the code duplication in newShuffleMapStage and 
newResultStage.  This can be done in generic fashion via runtime reflection:
```scala
import scala.reflect.runtime.{universe = ru}
...
  private def newStage[T : Stage: ru.TypeTag](
  rdd: RDD[_],
  numTasks: Int,
  shuffleDep: Option[ShuffleDependency[_, _, _]],
  jobId: Int,
  callSite: CallSite): T = {
val m = ru.runtimeMirror(getClass.getClassLoader)
val classT = ru.typeOf[T].typeSymbol.asClass
val cm = m.reflectClass(classT)
val ctor = ru.typeOf[T].declaration(ru.nme.CONSTRUCTOR).asMethod
val ctorm = cm.reflectConstructor(ctor)
val parentStages = getParentStages(rdd, jobId)
val id = nextStageId.getAndIncrement()
val stage = shuffleDep.map { shufDep =
  ctorm(id, rdd, numTasks, parentStages, jobId, callSite, shufDep)
}.getOrElse(ctorm(id, rdd, numTasks, parentStages, jobId, 
callSite)).asInstanceOf[T]

stageIdToStage(id) = stage
updateJobIdStageIdMaps(jobId, stage)
stage
  }
...
  val stage = newStage[ShuffleMapStage](rdd, numTasks, Some(shuffleDep), 
jobId, rdd.creationSite)
...
  finalStage = newStage[ResultStage](finalRDD, partitions.size, None, 
jobId, callSite)
```
...but I'd want to see the performance numbers on that before deciding not 
to go with a less flexible approach that avoids reflection:
```scala
  private def newStage[T : Stage](
   

[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-23 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/4708#discussion_r25186634
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/Stage.scala ---
@@ -77,52 +71,9 @@ private[spark] class Stage(
   /** Pointer to the latest [StageInfo] object, set by DAGScheduler. */
   var latestInfo: StageInfo = StageInfo.fromStage(this)
 
-  def isAvailable: Boolean = {
-if (!isShuffleMap) {
-  true
-} else {
-  numAvailableOutputs == numPartitions
-}
-  }
-
-  def addOutputLoc(partition: Int, status: MapStatus) {
-val prevList = outputLocs(partition)
-outputLocs(partition) = status :: prevList
-if (prevList == Nil) {
-  numAvailableOutputs += 1
-}
-  }
-
-  def removeOutputLoc(partition: Int, bmAddress: BlockManagerId) {
-val prevList = outputLocs(partition)
-val newList = prevList.filterNot(_.location == bmAddress)
-outputLocs(partition) = newList
-if (prevList != Nil  newList == Nil) {
-  numAvailableOutputs -= 1
-}
-  }
+  var numAvailableOutputs = 0
--- End diff --

...for nextAttemptId, too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-23 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75604216
  
  [Test build #27858 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27858/consoleFull)
 for   PR 4708 at commit 
[`d548caf`](https://github.com/apache/spark/commit/d548cafab4b6f36ee7e9bed696419567f0bc3d94).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75609741
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27854/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75610748
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27856/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-23 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/4708#discussion_r25186416
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/Stage.scala ---
@@ -47,26 +47,20 @@ import org.apache.spark.util.CallSite
  * be updated for each attempt.
--- End diff --

Remove unused BlockManagerId from imports


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-23 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75594276
  
  [Test build #27856 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27856/consoleFull)
 for   PR 4708 at commit 
[`6da3a71`](https://github.com/apache/spark/commit/6da3a7101c3c8087a9a924b998889eb6e1b3446f).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-23 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/4708#discussion_r25186558
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/Stage.scala ---
@@ -77,52 +71,9 @@ private[spark] class Stage(
   /** Pointer to the latest [StageInfo] object, set by DAGScheduler. */
   var latestInfo: StageInfo = StageInfo.fromStage(this)
 
-  def isAvailable: Boolean = {
-if (!isShuffleMap) {
-  true
-} else {
-  numAvailableOutputs == numPartitions
-}
-  }
-
-  def addOutputLoc(partition: Int, status: MapStatus) {
-val prevList = outputLocs(partition)
-outputLocs(partition) = status :: prevList
-if (prevList == Nil) {
-  numAvailableOutputs += 1
-}
-  }
-
-  def removeOutputLoc(partition: Int, bmAddress: BlockManagerId) {
-val prevList = outputLocs(partition)
-val newList = prevList.filterNot(_.location == bmAddress)
-outputLocs(partition) = newList
-if (prevList != Nil  newList == Nil) {
-  numAvailableOutputs -= 1
-}
-  }
+  var numAvailableOutputs = 0
--- End diff --

Add explicit type declaration


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-23 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/4708#discussion_r25188257
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -228,22 +227,41 @@ class DAGScheduler(
   }
 
   /**
-   * Create a Stage -- either directly for use as a result stage, or as 
part of the (re)-creation
-   * of a shuffle map stage in newOrUsedStage.  The stage will be 
associated with the provided
-   * jobId. Production of shuffle map stages should always use 
newOrUsedStage, not newStage
-   * directly.
+   * Create a ShuffleMapStage as part of the (re)-creation of a shuffle 
map stage in 
+   * newOrUsedShuffleStage.  The stage will be associated with the provided
+   * jobId. Production of shuffle map stages should always use 
newOrUsedShuffleStage, 
+   * not newShuffleMapStage directly.
--- End diff --

nit: reformat a little...
```scala
  /**
   * Create a ShuffleMapStage as part of the (re)-creation of a shuffle map 
stage in 
   * newOrUsedShuffleStage.  The stage will be associated with the provided 
jobId.
   * Production of shuffle map stages should always use 
newOrUsedShuffleStage, not
   * newShuffleMapStage directly.
   */
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-23 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75609726
  
  [Test build #27854 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27854/consoleFull)
 for   PR 4708 at commit 
[`b85c5fe`](https://github.com/apache/spark/commit/b85c5fe14fdece4769fc98bbedcba80252b325bf).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-23 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75610731
  
  [Test build #27856 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27856/consoleFull)
 for   PR 4708 at commit 
[`6da3a71`](https://github.com/apache/spark/commit/6da3a7101c3c8087a9a924b998889eb6e1b3446f).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-23 Thread kayousterhout
Github user kayousterhout commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75602215
  
@JoshRosen I'm happy to take a look at this but won't be able to get to it 
until Friday.  Feel free to merge it sooner than that if you're eager to get it 
in; otherwise I'll take a look Friday!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-23 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75620128
  
  [Test build #27858 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27858/consoleFull)
 for   PR 4708 at commit 
[`d548caf`](https://github.com/apache/spark/commit/d548cafab4b6f36ee7e9bed696419567f0bc3d94).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75620137
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27858/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-23 Thread markhamstra
Github user markhamstra commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75653958
  
nits: extra spaces in type declarations.  E.g,:
```scala
var numAvailableOutputs: Long = 0
```
not
```scala
var numAvailableOutputs : Long = 0
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75681861
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27871/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-23 Thread ilganeli
Github user ilganeli commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75691936
  
What happened with this test? Seems like it failed to fetch from git?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-22 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75492348
  
Hmm, it looks like MiMa is complaining about the `Stage` class becoming 
abstract even though it's `private[spark]`:

```
[info] spark-core: found 1 potential binary incompatibilities (filtered 352)
[error]  * class org.apache.spark.scheduler.Stage was concrete; is declared 
abstract in new version
[error]filter with: 
ProblemFilters.exclude[AbstractClassProblem](org.apache.spark.scheduler.Stage)
```

We can fix this by adding an exclude.  When you do this, make sure to add a 
comment referencing this JIRA so we know why it was introduced.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-22 Thread ilganeli
Github user ilganeli commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75493744
  
I've added an exclude for the Spark 1.3 build - hopefully that resolves 
this issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-22 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4708#discussion_r25145518
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -830,39 +836,39 @@ class DAGScheduler(
 try {
   // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).
   // For ResultTask, serialize and broadcast (rdd, func).
-  val taskBinaryBytes: Array[Byte] =
-if (stage.isShuffleMap) {
-  closureSerializer.serialize((stage.rdd, stage.shuffleDep.get) : 
AnyRef).array()
-} else {
-  closureSerializer.serialize((stage.rdd, 
stage.resultOfJob.get.func) : AnyRef).array()
-}
+  val taskBinaryBytes: Array[Byte] = stage match {
+case a: ShuffleMapStage =
+  closureSerializer.serialize((a.rdd, a.shuffleDep): 
AnyRef).array()
+case b: ResultStage =
+  closureSerializer.serialize((b.rdd, b.resultOfJob.get.func): 
AnyRef).array()
+  }
+
   taskBinary = sc.broadcast(taskBinaryBytes)
 } catch {
   // In the case of a failure during serialization, abort the stage.
   case e: NotSerializableException =
 abortStage(stage, Task not serializable:  + e.toString)
 runningStages -= stage
-return
   case NonFatal(e) =
 abortStage(stage, sTask serialization failed: 
$e\n${e.getStackTraceString})
 runningStages -= stage
-return
 }
 
-val tasks: Seq[Task[_]] = if (stage.isShuffleMap) {
-  partitionsToCompute.map { id =
-val locs = getPreferredLocs(stage.rdd, id)
-val part = stage.rdd.partitions(id)
-new ShuffleMapTask(stage.id, taskBinary, part, locs)
-  }
-} else {
-  val job = stage.resultOfJob.get
-  partitionsToCompute.map { id =
-val p: Int = job.partitions(id)
-val part = stage.rdd.partitions(p)
-val locs = getPreferredLocs(stage.rdd, p)
-new ResultTask(stage.id, taskBinary, part, locs, id)
-  }
+val tasks: Seq[Task[_]] = stage match {
+  case a: ShuffleMapStage =
--- End diff --

It's legal to have a case statement where multiple branches use the same 
variable name (because there's no fall-through):

```scala
scala def foo(x: Any) = x match {
 | case y: String = y
 | case y: Int = y * 2
 | }
foo: (x: Any)Any

scala foo(1)
res1: Any = 2

scala foo(hello)
res2: Any = hello
```

Therefore, I think it would be clearer to replace `a` and `b` with `stage` 
(this will also reduce the size of the diff).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-22 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4708#discussion_r25145581
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -996,50 +1006,52 @@ class DAGScheduler(
 if (failedEpoch.contains(execId)  smt.epoch = 
failedEpoch(execId)) {
   logInfo(Ignoring possibly bogus ShuffleMapTask completion 
from  + execId)
 } else {
-  stage.addOutputLoc(smt.partitionId, status)
+  shuffleStage.addOutputLoc(smt.partitionId, status)
 }
-if (runningStages.contains(stage)  
stage.pendingTasks.isEmpty) {
-  markStageAsFinished(stage)
-  logInfo(looking for newly runnable stages)
+if (runningStages.contains(shuffleStage)  
shuffleStage.pendingTasks.isEmpty) {
+  markStageAsFinished(shuffleStage)
+  logInfo(looking for newly runnable shuffleStages)
--- End diff --

I think that this log message should remain unchanged, since technically I 
think we'll consider scheduling of all runnable stages, which might also 
include result stages


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-22 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4708#discussion_r25145702
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/ResultStage.scala 
---
@@ -0,0 +1,41 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.CallSite
+
+/**
+ * Define a class that represents the ResultStage to help clean up the 
DAGScheduler class 
--- End diff --

Same here; I don't think this comment is necessary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-22 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4708#discussion_r25145735
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/ResultStage.scala 
---
@@ -0,0 +1,41 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.CallSite
+
+/**
+ * Define a class that represents the ResultStage to help clean up the 
DAGScheduler class 
+ */
+private[spark] class ResultStage(
+override val id: Int,
+override val rdd: RDD[_],
+override val numTasks: Int,
+override val parents: List[Stage],
+override val jobId: Int,
+override val callSite: CallSite) 
+  extends Stage(id, rdd, numTasks, parents, jobId, callSite) {
+
+  /** For stages that are the final (consists of only ResultTasks), link 
to the ActiveJob. */
--- End diff --

Now that this field is only present in the new ResultStage class, maybe we 
could update this comment to say something like Links to the active job for 
this result stage.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75493784
  
  [Test build #27850 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27850/consoleFull)
 for   PR 4708 at commit 
[`a57dfcd`](https://github.com/apache/spark/commit/a57dfcd919342e9b06cddc0d7ac7e576cdcdafcd).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-22 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4708#discussion_r25145685
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/ShuffleMapStage.scala ---
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+import org.apache.spark.rdd.RDD
+import org.apache.spark.ShuffleDependency
+import org.apache.spark.storage.BlockManagerId
+import org.apache.spark.util.CallSite
+
+/**
+ * Define a class that represents the ShuffleMapStage to help clean up the 
DAGScheduler class 
--- End diff --

I don't think this description will make much sense after this patch is 
merged, since by that point the cleanup will have already been performed!  I 
think we can just omit this comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-22 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4708#discussion_r25145446
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -830,39 +836,39 @@ class DAGScheduler(
 try {
   // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).
   // For ResultTask, serialize and broadcast (rdd, func).
-  val taskBinaryBytes: Array[Byte] =
-if (stage.isShuffleMap) {
-  closureSerializer.serialize((stage.rdd, stage.shuffleDep.get) : 
AnyRef).array()
-} else {
-  closureSerializer.serialize((stage.rdd, 
stage.resultOfJob.get.func) : AnyRef).array()
-}
+  val taskBinaryBytes: Array[Byte] = stage match {
+case a: ShuffleMapStage =
+  closureSerializer.serialize((a.rdd, a.shuffleDep): 
AnyRef).array()
+case b: ResultStage =
+  closureSerializer.serialize((b.rdd, b.resultOfJob.get.func): 
AnyRef).array()
+  }
+
   taskBinary = sc.broadcast(taskBinaryBytes)
 } catch {
   // In the case of a failure during serialization, abort the stage.
   case e: NotSerializableException =
 abortStage(stage, Task not serializable:  + e.toString)
 runningStages -= stage
-return
--- End diff --

I don't think that we can remove this `return` (or the one line 849).  If 
we remove this, then we'll hit errors later on because `taskBinary` will still 
be null due to the serialization error.

I think we should add these return statements back and possibly add a 
comment to each occurrence explaining why it's needed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-22 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75494807
  
I took a quick pass, but overall this looks pretty good to me since it's 
inline with the refactoring that I had originally planned to do myself.  It's 
nice to see a lot of the weird `isShuffleMap` stuff removed.  I also like how 
this refactoring clarifies that ShuffleMapTasks always belong to 
ShuffleMapStages, etc.

Maybe @markhamstra or @kayousterhout would like to take a look at this?  
This looks like a pretty straightforward, low-risk refactoring to me, but it 
would be good to get more eyes on it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75499003
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27850/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75499000
  
  [Test build #27850 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27850/consoleFull)
 for   PR 4708 at commit 
[`a57dfcd`](https://github.com/apache/spark/commit/a57dfcd919342e9b06cddc0d7ac7e576cdcdafcd).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-20 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/4703#discussion_r25058069
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -484,7 +505,7 @@ class DAGScheduler(
   Total number of partitions:  + maxPartitions)
 }
 
-val jobId = nextJobId.getAndIncrement()
+val jobId = nextJobId.getAndIncrement
--- End diff --

mutator should include the `()`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-20 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/4703#discussion_r25055713
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/Stage.scala ---
@@ -47,26 +47,19 @@ import org.apache.spark.util.CallSite
  * be updated for each attempt.
  *
  */
-private[spark] class Stage(
-val id: Int,
-val rdd: RDD[_],
-val numTasks: Int,
-val shuffleDep: Option[ShuffleDependency[_, _, _]],  // Output shuffle 
if stage is a map stage
-val parents: List[Stage],
-val jobId: Int,
-val callSite: CallSite)
+private[spark] abstract class Stage(val id: Int,
+val rdd: RDD[_],
+val numTasks: Int,
+val parents: List[Stage],
+val jobId: Int,
+val callSite: CallSite)
--- End diff --

Follow the prior indentation style, please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-20 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/4703#discussion_r25056356
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/Stage.scala ---
@@ -132,9 +88,7 @@ private[spark] class Stage(
   }
 
   def attemptId: Int = nextAttemptId
-
-  override def toString = Stage  + id
-
+  
   override def hashCode(): Int = id
 
   override def equals(other: Any): Boolean = other match {
--- End diff --

`hashCode` and `equals` should probably be `final`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-20 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/4703#discussion_r25057714
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -229,41 +227,56 @@ class DAGScheduler(
 
   /**
* Create a Stage -- either directly for use as a result stage, or as 
part of the (re)-creation
-   * of a shuffle map stage in newOrUsedStage.  The stage will be 
associated with the provided
-   * jobId. Production of shuffle map stages should always use 
newOrUsedStage, not newStage
-   * directly.
+   * of a shuffle map stage in newOrUsedShuffleStage.  The stage will be 
associated with the provided
+   * jobId. Production of shuffle map stages should always use 
newOrUsedShuffleStage, 
+   * not getAndRegNewShuffleMap directly.
*/
-  private def newStage(
-  rdd: RDD[_],
-  numTasks: Int,
-  shuffleDep: Option[ShuffleDependency[_, _, _]],
-  jobId: Int,
-  callSite: CallSite)
-: Stage =
-  {
+  private def getAndRegNewShuffleMap(rdd: RDD[_],
+ numTasks: Int,
+ shuffleDep: ShuffleDependency[_, _, 
_],
+ jobId: Int,
+ callSite: CallSite): ShuffleMapStage 
= {
 val parentStages = getParentStages(rdd, jobId)
-val id = nextStageId.getAndIncrement()
-val stage = new Stage(id, rdd, numTasks, shuffleDep, parentStages, 
jobId, callSite)
+val id = nextStageId.getAndIncrement
+val stage: ShuffleMapStage = new ShuffleMapStage(id, rdd, numTasks, 
parentStages, 
+  jobId, callSite, shuffleDep)
+  
 stageIdToStage(id) = stage
 updateJobIdStageIdMaps(jobId, stage)
 stage
   }
 
   /**
+   * Create a Stage -- either directly for use as a result stage, or as 
part of the (re)-creation
+   * of a shuffle map stage in newOrUsedShuffleStage.  The stage will be 
associated with the provided
+   * jobId. Production of shuffle map stages should always use 
newOrUsedShuffleStage, 
+   * not getAndRegNewResults directly.
+   */
+  private def getAndRegNewResults(rdd: RDD[_],
--- End diff --

I'm not a fan of the `getAndReg` addition since `get` falsely implies some 
kind of look up, and `reg` is more-or-less an implementation detail that across 
future refactorings won't necessarily be a fixed part of the creation of new 
stages.  The new, longer naming just doesn't add anything positive, imo.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-20 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/4703#discussion_r25057992
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -306,26 +319,31 @@ class DAGScheduler(
   }
 }
 waitingForVisit.push(rdd)
-while (!waitingForVisit.isEmpty) {
+while (waitingForVisit.nonEmpty) {
   visit(waitingForVisit.pop())
 }
 parents.toList
   }
-
-  // Find ancestor missing shuffle dependencies and register into 
shuffleToMapStage
+  
+  /**
+   * Find ancestor missing shuffle dependencies and register into 
shuffleToMapStage 
+   * @param shuffleDep
+   * @param jobId
--- End diff --

Placeholder doc annotations aren't helpful for `private` methods.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-20 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/4703#discussion_r25056104
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/Stage.scala ---
@@ -77,53 +70,16 @@ private[spark] class Stage(
   /** Pointer to the latest [StageInfo] object, set by DAGScheduler. */
   var latestInfo: StageInfo = StageInfo.fromStage(this)
 
+  var numAvailableOutputs = 0
+
   def isAvailable: Boolean = {
-if (!isShuffleMap) {
+if (!this.isInstanceOf[ShuffleMapStage]) {
   true
 } else {
   numAvailableOutputs == numPartitions
 }
   }
--- End diff --

Why not leave this abstract with just one branch of the if in each override?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-20 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/4703#discussion_r25056762
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -229,41 +227,56 @@ class DAGScheduler(
 
   /**
* Create a Stage -- either directly for use as a result stage, or as 
part of the (re)-creation
-   * of a shuffle map stage in newOrUsedStage.  The stage will be 
associated with the provided
-   * jobId. Production of shuffle map stages should always use 
newOrUsedStage, not newStage
-   * directly.
+   * of a shuffle map stage in newOrUsedShuffleStage.  The stage will be 
associated with the provided
+   * jobId. Production of shuffle map stages should always use 
newOrUsedShuffleStage, 
+   * not getAndRegNewShuffleMap directly.
*/
-  private def newStage(
-  rdd: RDD[_],
-  numTasks: Int,
-  shuffleDep: Option[ShuffleDependency[_, _, _]],
-  jobId: Int,
-  callSite: CallSite)
-: Stage =
-  {
+  private def getAndRegNewShuffleMap(rdd: RDD[_],
+ numTasks: Int,
+ shuffleDep: ShuffleDependency[_, _, 
_],
+ jobId: Int,
+ callSite: CallSite): ShuffleMapStage 
= {
 val parentStages = getParentStages(rdd, jobId)
-val id = nextStageId.getAndIncrement()
-val stage = new Stage(id, rdd, numTasks, shuffleDep, parentStages, 
jobId, callSite)
+val id = nextStageId.getAndIncrement
+val stage: ShuffleMapStage = new ShuffleMapStage(id, rdd, numTasks, 
parentStages, 
+  jobId, callSite, shuffleDep)
+  
 stageIdToStage(id) = stage
 updateJobIdStageIdMaps(jobId, stage)
 stage
   }
 
   /**
+   * Create a Stage -- either directly for use as a result stage, or as 
part of the (re)-creation
+   * of a shuffle map stage in newOrUsedShuffleStage.  The stage will be 
associated with the provided
+   * jobId. Production of shuffle map stages should always use 
newOrUsedShuffleStage, 
+   * not getAndRegNewResults directly.
+   */
+  private def getAndRegNewResults(rdd: RDD[_],
+ numTasks: Int,
+ jobId: Int,
+ callSite: CallSite): ResultStage = {
--- End diff --

Indentation of this one's even worse than the others.  Please fix all your 
`def`s to adhere to the Style Guide.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-20 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/4703#discussion_r25058302
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -912,6 +959,196 @@ class DAGScheduler(
   }
 
   /**
+   * Helper function to handle stage completion
+   * @param stage
+   * @param errorMessage
+   * @return
+   */
+  private def markStageAsFinished(stage: Stage, errorMessage: 
Option[String] = None) = {
+val serviceTime = stage.latestInfo.submissionTime match {
+  case Some(t) = %.03f.format((clock.getTime() - t) / 1000.0)
+  case _ = Unknown
+}
+if (errorMessage.isEmpty) {
+  logInfo(%s (%s) finished in %s s.format(stage, stage.name, 
serviceTime))
+  stage.latestInfo.completionTime = Some(clock.getTime())
+} else {
+  stage.latestInfo.stageFailed(errorMessage.get)
+  logInfo(%s (%s) failed in %s s.format(stage, stage.name, 
serviceTime))
+}
+listenerBus.post(SparkListenerStageCompleted(stage.latestInfo))
+runningStages -= stage
+  }
+
+  /**
+   * Helper functino to process a ResultTask completion
--- End diff --

typo


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75332520
  
  [Test build #27793 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27793/consoleFull)
 for   PR 4708 at commit 
[`96dd161`](https://github.com/apache/spark/commit/96dd161b604dd999448172047519e201e25ed28b).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4703#issuecomment-7531
  
  [Test build #27788 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27788/consoleFull)
 for   PR 4703 at commit 
[`8ed51d8`](https://github.com/apache/spark/commit/8ed51d81513b36289bbd76164b6a705ef1889ba3).
 * This patch **fails MiMa tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-20 Thread ilganeli
Github user ilganeli commented on the pull request:

https://github.com/apache/spark/pull/4703#issuecomment-75320629
  
Josh, what I'll do is create a new branch and port the changes individually 
and leave this branch intact as a reference. I'll submit a separate PR for that 
work. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4703#issuecomment-75318899
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27788/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-20 Thread ilganeli
GitHub user ilganeli opened a pull request:

https://github.com/apache/spark/pull/4708

[SPARK-4655] Split Stage into ShuffleMapStage and ResultStage subclasses

Hi all - this patch includes splitting up Stage into ShuffleMapStage and 
ResultStage and updating their usage within DAGScheduler.

I wanted to confirm that it's appropriate to move the outputLocs variable 
and its associated methods into ShuffleMapStage. Within the shuffleMapTask 
handler (line 1004) I wanted to confirm that we are guaranteed for that stage 
to be a ShuffleMapStage

I believe I've split up the functionality of ResultStage and 
ShuffleMapStage appropriately but I wanted to make sure I'm not missing 
something.

Lastly, I suggest opening a JIRA to further improve readability of two 
functions within the DAGScheduler:

1) The monolithic handleTaskCompletion 
2) The cleanupStateForJobAndIndependentStages function

This last item is already code-complete in a separate PR but it should 
really be done within its own issue. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ilganeli/spark SPARK-4655

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4708.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4708


commit 83494e94ff45e0f275966d405d777d22cb167800
Author: Ilya Ganelin ilya.gane...@capitalone.com
Date:   2015-02-20T21:22:18Z

Added new Stage classes

commit cfd6f10d4f30d1940c7dbc61b58af22cfd647980
Author: Ilya Ganelin ilya.gane...@capitalone.com
Date:   2015-02-20T21:54:29Z

Updated DAGScheduler to use new ResultStage and ShuffleMapStage classes




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75327500
  
  [Test build #27792 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27792/consoleFull)
 for   PR 4708 at commit 
[`cfd6f10`](https://github.com/apache/spark/commit/cfd6f10d4f30d1940c7dbc61b58af22cfd647980).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75327654
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27792/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75341191
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27793/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75353758
  
  [Test build #27805 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27805/consoleFull)
 for   PR 4708 at commit 
[`83ed849`](https://github.com/apache/spark/commit/83ed849dbc063d8a6ec3b763932792b092a5a2d3).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-20 Thread ilganeli
Github user ilganeli commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75343825
  
Is this another of the spurious test failure? Seems to be some bizarre 
thing within ML Lib.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75355985
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27805/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75355984
  
  [Test build #27805 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27805/consoleFull)
 for   PR 4708 at commit 
[`83ed849`](https://github.com/apache/spark/commit/83ed849dbc063d8a6ec3b763932792b092a5a2d3).
 * This patch **fails MiMa tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75341185
  
  [Test build #27793 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27793/consoleFull)
 for   PR 4708 at commit 
[`96dd161`](https://github.com/apache/spark/commit/96dd161b604dd999448172047519e201e25ed28b).
 * This patch **fails MiMa tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4703#issuecomment-75284304
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27780/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4703#issuecomment-75284266
  
  [Test build #27780 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27780/consoleFull)
 for   PR 4703 at commit 
[`09a3437`](https://github.com/apache/spark/commit/09a343746602ec8cb6630de9d3308513ff73eea0).
 * This patch **does not merge cleanly**.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4703#issuecomment-75284298
  
  [Test build #27780 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27780/consoleFull)
 for   PR 4703 at commit 
[`09a3437`](https://github.com/apache/spark/commit/09a343746602ec8cb6630de9d3308513ff73eea0).
 * This patch **fails RAT tests**.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-20 Thread ilganeli
Github user ilganeli commented on the pull request:

https://github.com/apache/spark/pull/4703#issuecomment-75327639
  
I've submitted a new PR here:
https://github.com/apache/spark/pull/4708


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4708#issuecomment-75327652
  
  [Test build #27792 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27792/consoleFull)
 for   PR 4708 at commit 
[`cfd6f10`](https://github.com/apache/spark/commit/cfd6f10d4f30d1940c7dbc61b58af22cfd647980).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4703#issuecomment-75286796
  
  [Test build #27782 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27782/consoleFull)
 for   PR 4703 at commit 
[`6266165`](https://github.com/apache/spark/commit/6266165ee44587621cde059e21648b0c4a9f2dfd).
 * This patch **does not merge cleanly**.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4703#issuecomment-75286815
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27782/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4703#issuecomment-75288560
  
  [Test build #27783 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27783/consoleFull)
 for   PR 4703 at commit 
[`0a4109e`](https://github.com/apache/spark/commit/0a4109ecb1982130e7f561e8f72d47c6fd0cd9e0).
 * This patch **fails RAT tests**.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4703#issuecomment-75288542
  
  [Test build #27783 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27783/consoleFull)
 for   PR 4703 at commit 
[`0a4109e`](https://github.com/apache/spark/commit/0a4109ecb1982130e7f561e8f72d47c6fd0cd9e0).
 * This patch **does not merge cleanly**.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4703#issuecomment-75286811
  
  [Test build #27782 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27782/consoleFull)
 for   PR 4703 at commit 
[`6266165`](https://github.com/apache/spark/commit/6266165ee44587621cde059e21648b0c4a9f2dfd).
 * This patch **fails RAT tests**.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4703#issuecomment-75288566
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27783/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4703#issuecomment-75304156
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27787/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4703#issuecomment-75304153
  
  [Test build #27787 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27787/consoleFull)
 for   PR 4703 at commit 
[`fe4e137`](https://github.com/apache/spark/commit/fe4e1378f3a56b2d633a5beabe324eede24c43dd).
 * This patch **fails RAT tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4703#issuecomment-75304143
  
  [Test build #27787 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27787/consoleFull)
 for   PR 4703 at commit 
[`fe4e137`](https://github.com/apache/spark/commit/fe4e1378f3a56b2d633a5beabe324eede24c43dd).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4703#issuecomment-75306668
  
  [Test build #27788 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27788/consoleFull)
 for   PR 4703 at commit 
[`8ed51d8`](https://github.com/apache/spark/commit/8ed51d81513b36289bbd76164b6a705ef1889ba3).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-20 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/4703#issuecomment-75307867
  
What do you think about performing only the Stage split in this PR and 
deferring the other code movement to a separate one?  That would make this much 
easier to review, since I'm pretty sure that the Stage split itself should 
involve comparatively few changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-20 Thread ilganeli
Github user ilganeli commented on the pull request:

https://github.com/apache/spark/pull/4703#issuecomment-75308162
  
Hi @JoshRosen, I'm down for making your life easier. Would you still be 
interested in the other refactoring that I did as part of this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-20 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/4703#issuecomment-75311331
  
Hi @ilganeli,

Those other refactoring might be welcome, since the some of the 
mega-functions in DAGScheduler can be pretty hard to read.  Let's do those 
separately, though.  My conservatism here stems from the fact that DAGScheduler 
is a giant piece of complex code that only a handful of people understand well. 
 Given that there might be objections to performing _any_ unnecessary 
refactoring due to the potential to introduce bugs, I think it's best to work 
very gradually and perform many small, separate sets of changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-19 Thread ilganeli
GitHub user ilganeli opened a pull request:

https://github.com/apache/spark/pull/4703

[SPARK-4655] Split Stage into ShuffleMapStage and ResultStage subclasses

Hi all - this patch includes two main efforts:

1) I've split up Stage into ShuffleMapStage and ResultStage and updated 
their usage within DAGScheduler
2) While doing this, I ended up cleaning up a lot of the long and awkward 
functions within the DAGScheduler since I needed to do that to make sense of 
the code. I believe this improved readability of the code significantly - the 
only problem is that it's not straightforward to diff those changes side by 
side (particularly the updates to handleFetchFailure(). 

I wanted to confirm that it's appropriate to move the outputLocs variable 
and its associated methods into ShuffleMapStage. The one function of concern is 
within the completeStage() and handleFailedTasks() functions where I wanted to 
confirm that we are guaranteed for that stage to be a ShuffleMapStage

I believe I've split up the functionality of ResultStage and 
ShuffleMapStage appropriately but I wanted to make sure I'm not missing 
something.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ilganeli/spark SPARK-4653

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4703.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4703


commit 424fdef9578b4450ec75c8d84db98927e4b15831
Author: Ilya Ganelin ilya.gane...@capitalone.com
Date:   2015-02-18T01:19:20Z

Began refactoring Stage.scala

commit 562b688d07a9404196c6856b5f3302c31b05bb2d
Author: Ilya Ganelin ilya.gane...@capitalone.com
Date:   2015-02-19T00:54:36Z

Significant refactoring within DAG Scheduler class

commit 75fe74f616adfa4e2da7a439fc86f784265aae84
Author: Ilya Ganelin ilya.gane...@capitalone.com
Date:   2015-02-19T00:58:34Z

Merge remote-tracking branch 'upstream/master' into SPARK-4653

commit 5e76fb8b21ba877afe4a9255036b8120691aeda8
Author: Ilya Ganelin ilya.gane...@capitalone.com
Date:   2015-02-19T01:06:23Z

Small updates

commit cb66a4fdf23883c24a7684b6f20ca532c74241a7
Author: Ilya Ganelin ilya.gane...@capitalone.com
Date:   2015-02-19T18:08:14Z

Further refactoring. Moved outputLocs and associated functions to inside 
ShuffleMapStage.

commit e3e7ea23584d3fb442ede54231aa048815a8f436
Author: Ilya Ganelin ilya.gane...@capitalone.com
Date:   2015-02-19T18:11:15Z

Minor formatting

commit 149b9a13e48371a1ea06a3fc85374fa026368cd8
Author: Ilya Ganelin ilya.gane...@capitalone.com
Date:   2015-02-19T22:06:21Z

Refactored more extremely large methods to be more modular. Cleaned up some 
usages of vs nonEmpty.

commit be3e01fad73a3d1124ad090acbef6f38379afc4e
Author: Ilya Ganelin ilya.gane...@capitalone.com
Date:   2015-02-19T22:58:56Z

Minor fixes

commit 00182c7e7f9a38d5eee86065171804acc52e6d2a
Author: Ilya Ganelin ilya.gane...@capitalone.com
Date:   2015-02-20T07:43:59Z

Cleaned up hanleTaskCompletion more




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4703#issuecomment-75201638
  
  [Test build #27771 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27771/consoleFull)
 for   PR 4703 at commit 
[`00182c7`](https://github.com/apache/spark/commit/00182c7e7f9a38d5eee86065171804acc52e6d2a).
 * This patch **fails RAT tests**.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4703#issuecomment-75201642
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27771/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

2015-02-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4703#issuecomment-75201629
  
  [Test build #27771 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27771/consoleFull)
 for   PR 4703 at commit 
[`00182c7`](https://github.com/apache/spark/commit/00182c7e7f9a38d5eee86065171804acc52e6d2a).
 * This patch **does not merge cleanly**.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org