[GitHub] spark pull request #16250: [CORE][MINOR] Stylistic changes in DAGScheduler (...
Github user jaceklaskowski closed the pull request at: https://github.com/apache/spark/pull/16250 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16250: [CORE][MINOR] Stylistic changes in DAGScheduler (...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/16250#discussion_r92062052 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -333,16 +333,16 @@ class DAGScheduler( // (so we don't unnecessarily re-compute that data). val serLocs = mapOutputTracker.getSerializedMapOutputStatuses(shuffleDep.shuffleId) val locs = MapOutputTracker.deserializeMapStatuses(serLocs) - (0 until locs.length).foreach { i => -if (locs(i) ne null) { - // locs(i) will be null if missing - stage.addOutputLoc(i, locs(i)) + locs.zipWithIndex --- End diff -- Its hard to profile these things closely enough to figure out how much overhead it adds (depends on locs.length of course etc.). My vote would be to be conservative and keep the existing version --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16250: [CORE][MINOR] Stylistic changes in DAGScheduler (...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/16250#discussion_r92072568 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -333,16 +333,16 @@ class DAGScheduler( // (so we don't unnecessarily re-compute that data). val serLocs = mapOutputTracker.getSerializedMapOutputStatuses(shuffleDep.shuffleId) val locs = MapOutputTracker.deserializeMapStatuses(serLocs) - (0 until locs.length).foreach { i => -if (locs(i) ne null) { - // locs(i) will be null if missing - stage.addOutputLoc(i, locs(i)) + locs.zipWithIndex --- End diff -- I'd be wary of making this type of change since there are cases where we've purposely removed Scala-style transformations in hot scheduler loops. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16250: [CORE][MINOR] Stylistic changes in DAGScheduler (...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/16250#discussion_r91862235 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -333,16 +333,16 @@ class DAGScheduler( // (so we don't unnecessarily re-compute that data). val serLocs = mapOutputTracker.getSerializedMapOutputStatuses(shuffleDep.shuffleId) val locs = MapOutputTracker.deserializeMapStatuses(serLocs) - (0 until locs.length).foreach { i => -if (locs(i) ne null) { - // locs(i) will be null if missing - stage.addOutputLoc(i, locs(i)) + locs.zipWithIndex --- End diff -- just fyi this is going to be slower than the original code. I don't know if it matters, but it will be slower. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16250: [CORE][MINOR] Stylistic changes in DAGScheduler (...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/16250#discussion_r91860470 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -333,16 +333,16 @@ class DAGScheduler( // (so we don't unnecessarily re-compute that data). val serLocs = mapOutputTracker.getSerializedMapOutputStatuses(shuffleDep.shuffleId) val locs = MapOutputTracker.deserializeMapStatuses(serLocs) - (0 until locs.length).foreach { i => -if (locs(i) ne null) { - // locs(i) will be null if missing - stage.addOutputLoc(i, locs(i)) + locs.zipWithIndex +.filter { case (status, _) => status != null } // null if missing +.foreach { case (ms, idx) => + stage.addOutputLoc(idx, ms) } - } } else { // Kind of ugly: need to register RDDs with the cache and map output tracker here // since we can't do it in the RDD constructor because # of partitions is unknown - logInfo("Registering RDD " + rdd.id + " (" + rdd.getCreationSite + ")") + logInfo(s"Registering RDD ${rdd.id} (partitions: ${rdd.partitions.length}, " + --- End diff -- Yes, but it was because I think it's important at this "stage". Do you think I should remove the partition-related addition? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16250: [CORE][MINOR] Stylistic changes in DAGScheduler (...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/16250#discussion_r91857713 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1227,12 +1220,19 @@ class DAGScheduler( } } submitWaitingChildStages(shuffleStage) + } else { +// Some tasks had failed; let's resubmit this shuffleStage +// TODO: Lower-level scheduler should also deal with this +logInfo(s"Resubmitting $shuffleStage (${shuffleStage.name}) " + + "because some of its tasks had failed: " + + shuffleStage.findMissingPartitions().mkString(", ")) --- End diff -- If you bother with this, could go all the way and pull this out as a val --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16250: [CORE][MINOR] Stylistic changes in DAGScheduler (...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/16250#discussion_r91857696 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -333,16 +333,16 @@ class DAGScheduler( // (so we don't unnecessarily re-compute that data). val serLocs = mapOutputTracker.getSerializedMapOutputStatuses(shuffleDep.shuffleId) val locs = MapOutputTracker.deserializeMapStatuses(serLocs) - (0 until locs.length).foreach { i => -if (locs(i) ne null) { - // locs(i) will be null if missing - stage.addOutputLoc(i, locs(i)) + locs.zipWithIndex +.filter { case (status, _) => status != null } // null if missing +.foreach { case (ms, idx) => --- End diff -- Use ms or status consistency? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16250: [CORE][MINOR] Stylistic changes in DAGScheduler (...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/16250#discussion_r91857699 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -333,16 +333,16 @@ class DAGScheduler( // (so we don't unnecessarily re-compute that data). val serLocs = mapOutputTracker.getSerializedMapOutputStatuses(shuffleDep.shuffleId) val locs = MapOutputTracker.deserializeMapStatuses(serLocs) - (0 until locs.length).foreach { i => -if (locs(i) ne null) { - // locs(i) will be null if missing - stage.addOutputLoc(i, locs(i)) + locs.zipWithIndex +.filter { case (status, _) => status != null } // null if missing +.foreach { case (ms, idx) => + stage.addOutputLoc(idx, ms) } - } } else { // Kind of ugly: need to register RDDs with the cache and map output tracker here // since we can't do it in the RDD constructor because # of partitions is unknown - logInfo("Registering RDD " + rdd.id + " (" + rdd.getCreationSite + ")") + logInfo(s"Registering RDD ${rdd.id} (partitions: ${rdd.partitions.length}, " + --- End diff -- No big deal but this is changing the log message too --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16250: [CORE][MINOR] Stylistic changes in DAGScheduler (...
GitHub user jaceklaskowski opened a pull request: https://github.com/apache/spark/pull/16250 [CORE][MINOR] Stylistic changes in DAGScheduler (to ease comprehensio⦠## What changes were proposed in this pull request? Stylistic changes in `DAGScheduler` to ease comprehension thereof. ## How was this patch tested? Local build. Awaiting Jenkins to run the entire test suite. â¦n thereof) You can merge this pull request into a Git repository by running: $ git pull https://github.com/jaceklaskowski/spark dagscheduler-minors Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16250.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16250 commit bc39671cf6868f033b69179cbdc6e6c373904149 Author: Jacek Laskowski Date: 2016-12-11T15:00:52Z [CORE][MINOR] Stylistic changes in DAGScheduler (to ease comprehension thereof) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org