[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

2014-08-01 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1418#issuecomment-50855896
  
Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

2014-08-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1418#issuecomment-50856200
  
QA tests have started for PR 1418. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17661/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

2014-08-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1418#issuecomment-50859287
  
QA results for PR 1418:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17661/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

2014-08-01 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1418#issuecomment-50915770
  
Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

2014-08-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1418#issuecomment-50916408
  
QA tests have started for PR 1418. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17687/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

2014-08-01 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1418#issuecomment-50923486
  
Thanks Liang-Chi; I've merged this in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

2014-08-01 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1418


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

2014-07-31 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/1418#discussion_r15648333
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -275,18 +287,53 @@ class DAGScheduler(
 case shufDep: ShuffleDependency[_, _, _] =
   parents += getShuffleMapStage(shufDep, jobId)
 case _ =
-  visit(dep.rdd)
+  waitingForVisit.push(dep.rdd)
   }
 }
   }
 }
-visit(rdd)
+waitingForVisit.push(rdd)
+while (!waitingForVisit.isEmpty) {
+  visit(waitingForVisit.pop())
+}
 parents.toList
   }
 
+  private def getParentShuffleDependencies(rdd: RDD[_]): 
Stack[ShuffleDependency[_, _, _]] = {
--- End diff --

Add new commit for several changes including function name, comments. 
Please review if it is fine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

2014-07-30 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/1418#discussion_r15590707
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -275,18 +286,51 @@ class DAGScheduler(
 case shufDep: ShuffleDependency[_, _, _] =
   parents += getShuffleMapStage(shufDep, jobId)
 case _ =
-  visit(dep.rdd)
+  waitingForVisit.push(dep.rdd)
   }
 }
   }
 }
-visit(rdd)
+waitingForVisit.push(rdd)
+while (!waitingForVisit.isEmpty) {
+  visit(waitingForVisit.pop())
+}
 parents.toList
   }
 
+  private def getParentShuffleDependencies(rdd: RDD[_]): 
Stack[ShuffleDependency[_, _, _]] = {
+val parents = new Stack[ShuffleDependency[_, _, _]]
+val visited = new HashSet[RDD[_]]
+val waitingForVisit = new Stack[RDD[_]]
+def visit(r: RDD[_]) {
--- End diff --

I let the codes as the function `getParentShuffleDependencies` because it 
contains multiple indents and so put it under the `case` statement would not be 
readable. I can make it as inline if this is an issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

2014-07-30 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/1418#discussion_r15591064
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -265,6 +275,7 @@ class DAGScheduler(
   private def getParentStages(rdd: RDD[_], jobId: Int): List[Stage] = {
 val parents = new HashSet[Stage]
 val visited = new HashSet[RDD[_]]
+val waitingForVisit = new Stack[RDD[_]]
--- End diff --

OK. I add a commit for that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

2014-07-30 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/1418#discussion_r15592346
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -195,11 +195,21 @@ class DAGScheduler(
 shuffleToMapStage.get(shuffleDep.shuffleId) match {
   case Some(stage) = stage
   case None =
+val parentsWithNoMapStage = 
getParentShuffleDependencies(shuffleDep.rdd)
+while (!parentsWithNoMapStage.isEmpty) {
+  val currentShufDep = parentsWithNoMapStage.pop()
+  val stage =
+newOrUsedStage(
+  currentShufDep.rdd, currentShufDep.rdd.partitions.size, 
currentShufDep, jobId,
+  currentShufDep.rdd.creationSite)
+  shuffleToMapStage(currentShufDep.shuffleId) = stage
+}
--- End diff --

Yes. There is a recursive calling chain formed by `getShuffleMapStage` 
calling `newOrUsedStage` that calls `newStage` that calls `getParentStages` 
that calls `getShuffleMapStage` again. The calling chain is going to fill 
`shuffleToMapStage`  for parent map stages. Here I use an iterative approach 
instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

2014-07-30 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1418#discussion_r15609799
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -195,11 +195,21 @@ class DAGScheduler(
 shuffleToMapStage.get(shuffleDep.shuffleId) match {
   case Some(stage) = stage
   case None =
+val parentsWithNoMapStage = 
getParentShuffleDependencies(shuffleDep.rdd)
+while (!parentsWithNoMapStage.isEmpty) {
+  val currentShufDep = parentsWithNoMapStage.pop()
+  val stage =
+newOrUsedStage(
+  currentShufDep.rdd, currentShufDep.rdd.partitions.size, 
currentShufDep, jobId,
+  currentShufDep.rdd.creationSite)
+  shuffleToMapStage(currentShufDep.shuffleId) = stage
+}
--- End diff --

Ah, I see it. Please add a comment on top that explains this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

2014-07-30 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1418#discussion_r15609962
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -275,18 +287,53 @@ class DAGScheduler(
 case shufDep: ShuffleDependency[_, _, _] =
   parents += getShuffleMapStage(shufDep, jobId)
 case _ =
-  visit(dep.rdd)
+  waitingForVisit.push(dep.rdd)
   }
 }
   }
 }
-visit(rdd)
+waitingForVisit.push(rdd)
+while (!waitingForVisit.isEmpty) {
+  visit(waitingForVisit.pop())
+}
 parents.toList
   }
 
+  private def getParentShuffleDependencies(rdd: RDD[_]): 
Stack[ShuffleDependency[_, _, _]] = {
--- End diff --

This should actually be called getAncestorShuffleDependencies because it's 
not just direct parents, it can be grandparents and such as well.

Also, please add a comment at the top of this saying that what it finds. In 
particular it only finds missing ones.

Finally, would it be possible to merge this with the code that calls it in 
getShuffleMapStage, e.g. have a method called registerShuffleDependencies? 
Might be easier to follow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

2014-07-29 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1418#discussion_r15558011
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -275,18 +286,51 @@ class DAGScheduler(
 case shufDep: ShuffleDependency[_, _, _] =
   parents += getShuffleMapStage(shufDep, jobId)
 case _ =
-  visit(dep.rdd)
+  waitingForVisit.push(dep.rdd)
   }
 }
   }
 }
-visit(rdd)
+waitingForVisit.push(rdd)
+while (!waitingForVisit.isEmpty) {
+  visit(waitingForVisit.pop())
+}
 parents.toList
   }
 
+  private def getParentShuffleDependencies(rdd: RDD[_]): 
Stack[ShuffleDependency[_, _, _]] = {
+val parents = new Stack[ShuffleDependency[_, _, _]]
+val visited = new HashSet[RDD[_]]
+val waitingForVisit = new Stack[RDD[_]]
+def visit(r: RDD[_]) {
--- End diff --

why define a function here? seems like this is only used once? why not just 
inline it in the while?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

2014-07-29 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1418#discussion_r15563247
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -265,6 +275,7 @@ class DAGScheduler(
   private def getParentStages(rdd: RDD[_], jobId: Int): List[Stage] = {
 val parents = new HashSet[Stage]
 val visited = new HashSet[RDD[_]]
+val waitingForVisit = new Stack[RDD[_]]
--- End diff --

Maybe add a comment that we are manually maintaining a stack to prevent 
StackOverflowError


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

2014-07-29 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1418#discussion_r15563292
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -195,11 +195,21 @@ class DAGScheduler(
 shuffleToMapStage.get(shuffleDep.shuffleId) match {
   case Some(stage) = stage
   case None =
+val parentsWithNoMapStage = 
getParentShuffleDependencies(shuffleDep.rdd)
+while (!parentsWithNoMapStage.isEmpty) {
+  val currentShufDep = parentsWithNoMapStage.pop()
+  val stage =
+newOrUsedStage(
+  currentShufDep.rdd, currentShufDep.rdd.partitions.size, 
currentShufDep, jobId,
+  currentShufDep.rdd.creationSite)
+  shuffleToMapStage(currentShufDep.shuffleId) = stage
+}
--- End diff --

Maybe I'm missing it, but this seems like an unrelated change. Is this 
needed by the non-recursive visit code?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

2014-07-29 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1418#discussion_r15563334
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -265,6 +275,7 @@ class DAGScheduler(
   private def getParentStages(rdd: RDD[_], jobId: Int): List[Stage] = {
 val parents = new HashSet[Stage]
 val visited = new HashSet[RDD[_]]
+val waitingForVisit = new Stack[RDD[_]]
--- End diff --

(Same on the other methods that use a stack)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

2014-07-28 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1418#issuecomment-50428405
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

2014-07-28 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1418#issuecomment-50428730
  
QA tests have started for PR 1418. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17331/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

2014-07-28 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1418#issuecomment-50431125
  
QA results for PR 1418:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17331/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

2014-07-27 Thread viirya
Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/1418#issuecomment-50257662
  
@mateiz Thanks for suggestion. I leave the PageRank example as-is. These 
braces are added to comply with code style.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

2014-07-27 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1418#issuecomment-50289064
  
Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

2014-07-26 Thread viirya
Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/1418#issuecomment-50225517
  
Thanks for commenting. How about the review?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

2014-07-26 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1418#discussion_r15436168
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/SparkPageRank.scala ---
@@ -51,6 +55,11 @@ object SparkPageRank {
 urls.map(url = (url, rank / size))
   }
   ranks = contribs.reduceByKey(_ + _).mapValues(0.15 + 0.85 * _)
+  if (i % 50 == 0) {
+ranks.cache()
+ranks.checkpoint()
+ranks.collect()
--- End diff --

Don't do a collect, if you want to force it to be computed, just do a 
`foreach`. Otherwise `collect` will try to bring all the data back to the 
driver.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

2014-07-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1418#issuecomment-50250201
  
QA tests have started for PR 1418. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17228/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

2014-07-26 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1418#issuecomment-50250223
  
Hey @viirya, instead of modifying the PageRank example, what do you think 
of leaving it as-is until we have automatic checkpointing of long lineage 
chains? I think that will be better because the example is meant to be easy to 
understand and not that many people will run PageRank with hundreds of 
iterations (this particular algorithm usually converges much faster).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

2014-07-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1418#issuecomment-50250213
  
QA results for PR 1418:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17228/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

2014-07-26 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1418#issuecomment-50250242
  
BTW the Jenkins failure is due to a code style issue: an if block without 
braces

Jenkins, this is ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

2014-07-26 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1418#discussion_r15436180
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -275,18 +286,48 @@ class DAGScheduler(
 case shufDep: ShuffleDependency[_, _, _] =
   parents += getShuffleMapStage(shufDep, jobId)
 case _ =
-  visit(dep.rdd)
+  waitingForVisit.push(dep.rdd)
   }
 }
   }
 }
-visit(rdd)
+waitingForVisit.push(rdd)
+while (!waitingForVisit.isEmpty)
+  visit(waitingForVisit.pop())
 parents.toList
   }
 
+  private def getParentShuffleDependencies(rdd: RDD[_]): 
Stack[ShuffleDependency[_, _, _]] = {
+val parents = new Stack[ShuffleDependency[_, _, _]]
+val visited = new HashSet[RDD[_]]
+val waitingForVisit = new Stack[RDD[_]]
+def visit(r: RDD[_]) {
+  if (!visited(r)) {
+visited += r
+for (dep - r.dependencies) {
+  dep match {
+case shufDep: ShuffleDependency[_, _, _] =
+  if (!shuffleToMapStage.contains(shufDep.shuffleId))
--- End diff --

This should have braces around the if statement's body


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

2014-07-26 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1418#issuecomment-50250276
  
BTW you can run sbt scalastyle to check these style things locally


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

2014-07-22 Thread markhamstra
Github user markhamstra commented on the pull request:

https://github.com/apache/spark/pull/1418#issuecomment-49773532
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

2014-07-20 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1418#issuecomment-49540404
  
Thanks for submitting this. I think we can still stack overflow in 
serialization, but I agree it's better to do this non-recursivley. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

2014-07-20 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1418#issuecomment-49540416
  
Actually it's late. I will review this tomorrow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

2014-07-16 Thread viirya
Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/1418#issuecomment-49184593
  
Another example of this problem is the PageRank example bundled in Spark. 
At this time, since the problem of Java serializer still exists, to avoid 
causing StackOverflowError after too many iterations, it is needed to call 
`checkpoint()` on the RDD.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

2014-07-15 Thread viirya
GitHub user viirya opened a pull request:

https://github.com/apache/spark/pull/1418

[SPARK-2490] Change recursive visiting on RDD dependencies to iterative 
approach


When performing some transformations on RDDs after many iterations, the 
dependencies of RDDs could be very long. It can easily cause StackOverflowError 
when recursively visiting these dependencies in Spark core. For example:

var rdd = sc.makeRDD(Array(1))
for (i - 1 to 1000) { 
  rdd = rdd.coalesce(1).cache()
  rdd.collect()
}

This PR changes recursive visiting on rdd's dependencies to iterative 
approach to avoid StackOverflowError. 

In addition to the recursive visiting, since the Java serializer has a 
known [bug](http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4152790) that 
causes StackOverflowError too when serializing/deserializing a large graph of 
objects. So applying this PR only solves part of the problem. Using 
KryoSerializer to replace Java serializer might be helpful. However, since 
KryoSerializer is not supported for `spark.closure.serializer` now, I can not 
test if KryoSerializer can solve Java serializer's problem completely. 






You can merge this pull request into a Git repository by running:

$ git pull https://github.com/viirya/spark-1 remove_recursive_visit

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1418.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1418


commit 900538bbcb61683bf1418534c2466463a630569f
Author: Liang-Chi Hsieh vii...@gmail.com
Date:   2014-07-15T10:58:45Z

change recursive visiting on rdd's dependencies to iterative approach to 
avoid stackoverflowerror.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

2014-07-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1418#issuecomment-49022775
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---