[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/11327 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-216658381 Merging this into master, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-216647234 Test build #2965 has finished successfully so I'm going by that. The other one had another unit test failure that is unrelated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-216646855 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57653/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-216646851 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-216646622 **[Test build #57653 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57653/consoleFull)** for PR 11327 at commit [`305a7db`](https://github.com/apache/spark/commit/305a7db60a2fc836035ed06c8207ce772c5e3b23). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-216644889 **[Test build #2965 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2965/consoleFull)** for PR 11327 at commit [`305a7db`](https://github.com/apache/spark/commit/305a7db60a2fc836035ed06c8207ce772c5e3b23). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-216621319 **[Test build #57653 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57653/consoleFull)** for PR 11327 at commit [`305a7db`](https://github.com/apache/spark/commit/305a7db60a2fc836035ed06c8207ce772c5e3b23). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-216620528 test failure is in ExternalAppendOnlyMapSuite which is unrelated. I'll kick jenkins again. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-216620552 Jenkins, test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-216613204 **[Test build #2965 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2965/consoleFull)** for PR 11327 at commit [`305a7db`](https://github.com/apache/spark/commit/305a7db60a2fc836035ed06c8207ce772c5e3b23). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-216612214 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-216612218 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57644/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-216611966 **[Test build #57644 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57644/consoleFull)** for PR 11327 at commit [`305a7db`](https://github.com/apache/spark/commit/305a7db60a2fc836035ed06c8207ce772c5e3b23). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-216603791 LGTM, pending tests. It's great to have 200X speedup, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-216585173 **[Test build #57644 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57644/consoleFull)** for PR 11327 at commit [`305a7db`](https://github.com/apache/spark/commit/305a7db60a2fc836035ed06c8207ce772c5e3b23). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-216584419 thanks for the review, made changes and updated description. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-216358709 @tgravescs Could you also update the description of reflect the new changes? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/11327#discussion_r61801165 --- Diff: core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala --- @@ -334,8 +332,41 @@ private class DefaultPartitionCoalescer(val balanceSlack: Double = 0.10) } } } else { + // It is possible to have unionRDD where one rdd has preferred locations and another rdd + // that doesn't. To make sure we end up with the requested number of partitions, + // make sure to put a partition in every group. + + if (groupArr.size > initialHash.size) { +// we don't have a partition assigned to every group yet so first try to fill them +// with the partitions with preferred locations +val partIter = partitionLocs.partsWithLocs.iterator +while (partIter.hasNext && initialHash.size < groupArr.size) { + var (nxt_replica, nxt_part) = partIter.next() + if (!initialHash.contains(nxt_part)) { +groupArr.find(pg => pg.numPartitions == 0).map(firstEmpty => { --- End diff -- same here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/11327#discussion_r61801132 --- Diff: core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala --- @@ -334,8 +332,41 @@ private class DefaultPartitionCoalescer(val balanceSlack: Double = 0.10) } } } else { + // It is possible to have unionRDD where one rdd has preferred locations and another rdd + // that doesn't. To make sure we end up with the requested number of partitions, + // make sure to put a partition in every group. + + if (groupArr.size > initialHash.size) { +// we don't have a partition assigned to every group yet so first try to fill them +// with the partitions with preferred locations +val partIter = partitionLocs.partsWithLocs.iterator +while (partIter.hasNext && initialHash.size < groupArr.size) { + var (nxt_replica, nxt_part) = partIter.next() + if (!initialHash.contains(nxt_part)) { +groupArr.find(pg => pg.numPartitions == 0).map(firstEmpty => { + firstEmpty.partitions += nxt_part + initialHash += nxt_part +}) + } +} + } + + // if we didn't get one partitions per group from partitions with preferred locations + // use partitions without preferred locations + val partNoLocIter = partitionLocs.partsWithoutLocs.iterator + while (partNoLocIter.hasNext && initialHash.size < groupArr.size) { +var nxt_part = partNoLocIter.next() +if (!initialHash.contains(nxt_part)) { + groupArr.find(pg => pg.numPartitions == 0).map(firstEmpty => { --- End diff -- This is still O(N*N) (the worst), it could be ``` groupArr.filter(pg => pg.numPartitions == 0).foreach { pg => while (partNoLocIter && pg.numPartitions == 0) { if (!initialHash.contains(nxt_part)) { pg.partitions += nxt_part initialHash += nxt_part } } } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/11327#discussion_r61798932 --- Diff: core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala --- @@ -320,7 +317,8 @@ private class DefaultPartitionCoalescer(val balanceSlack: Double = 0.10) } } - def throwBalls(maxPartitions: Int, prev: RDD[_], balanceSlack: Double) { + def throwBalls(maxPartitions: Int, prev: RDD[_], + balanceSlack: Double, partitionLocs: PartitionLocations) { --- End diff -- indents --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/11327#discussion_r61798850 --- Diff: core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala --- @@ -289,10 +284,12 @@ private class DefaultPartitionCoalescer(val balanceSlack: Double = 0.10) * imbalance in favor of locality * @return partition group (bin to be put in) */ - def pickBin(p: Partition, prev: RDD[_], balanceSlack: Double): PartitionGroup = { + def pickBin(p: Partition, prev: RDD[_], balanceSlack: Double, + partitionLocs: PartitionLocations): PartitionGroup = { --- End diff -- ``` def pickBin( p: Partition, prev: RDD[_], balanceSlack: Double, partitionLocs: PartitionLocations): PartitionGroup = { ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-216353266 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57556/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-216353258 **[Test build #57556 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57556/consoleFull)** for PR 11327 at commit [`f012cd5`](https://github.com/apache/spark/commit/f012cd5fc20feb20088c808275cf283d0f594cec). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-216353264 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-216352965 **[Test build #57556 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57556/consoleFull)** for PR 11327 at commit [`f012cd5`](https://github.com/apache/spark/commit/f012cd5fc20feb20088c808275cf283d0f594cec). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/11327#discussion_r61798109 --- Diff: core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala --- @@ -169,43 +169,41 @@ private class DefaultPartitionCoalescer(val balanceSlack: Double = 0.10) var noLocality = true // if true if no preferredLocations exists for parent RDD - // gets the *current* preferred locations from the DAGScheduler (as opposed to the static ones) - def currPrefLocs(part: Partition, prev: RDD[_]): Seq[String] = { -prev.context.getPreferredLocs(prev, part.index).map(tl => tl.host) - } - - // this class just keeps iterating and rotating infinitely over the partitions of the RDD - // next() returns the next preferred machine that a partition is replicated on - // the rotator first goes through the first replica copy of each partition, then second, third - // the iterators return type is a tuple: (replicaString, partition) - class LocationIterator(prev: RDD[_]) extends Iterator[(String, Partition)] { - -var it: Iterator[(String, Partition)] = resetIterator() - -override val isEmpty = !it.hasNext - -// initializes/resets to start iterating from the beginning -def resetIterator(): Iterator[(String, Partition)] = { - val iterators = (0 to 2).map { x => -prev.partitions.iterator.flatMap { p => - if (currPrefLocs(p, prev).size > x) Some((currPrefLocs(p, prev)(x), p)) else None + class PartitionLocations(prev: RDD[_]) { + +// contains all the partitions from the previous RDD that don't have preferred locations +val partsWithoutLocs = ArrayBuffer[Partition]() +// contains all the partitions from the previous RDD that have preferred locations +val partsWithLocs: Array[(String, Partition)] = getAllPrefLocs(prev) + +// has side affect of filling in partitions without locations as well +def getAllPrefLocs(prev: RDD[_]): Array[(String, Partition)] = { + val partsWithLocs = mutable.LinkedHashMap[Partition, Seq[String]]() + // first get the locations for each partition, only do this once since it can be expensive + prev.partitions.foreach(p => { + val locs = currPrefLocs(p, prev) + if (locs.size > 0) { +partsWithLocs.put(p, locs) + } else { +partsWithoutLocs += p + } } - } - iterators.reduceLeft((x, y) => x ++ y) + ) + // convert it into an array of host to partition + val allLocs = (0 to 2).map(x => +partsWithLocs.toArray.flatMap(parts => { + val p = parts._1 + val locs = parts._2 + if (locs.size > x) Some((locs(x), p)) else None +} ) + ) + allLocs.reduceLeft((x, y) => x ++ y) } + } -// hasNext() is false iff there are no preferredLocations for any of the partitions of the RDD -override def hasNext: Boolean = { !isEmpty } - -// return the next preferredLocation of some partition of the RDD -override def next(): (String, Partition) = { - if (it.hasNext) { -it.next() - } else { -it = resetIterator() // ran out of preferred locations, reset and rotate to the beginning -it.next() - } -} + // gets the *current* preferred locations from the DAGScheduler (as opposed to the static ones) + def currPrefLocs(part: Partition, prev: RDD[_]): Seq[String] = { --- End diff -- private or inline this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/11327#discussion_r61797836 --- Diff: core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala --- @@ -169,43 +169,41 @@ private class DefaultPartitionCoalescer(val balanceSlack: Double = 0.10) var noLocality = true // if true if no preferredLocations exists for parent RDD - // gets the *current* preferred locations from the DAGScheduler (as opposed to the static ones) - def currPrefLocs(part: Partition, prev: RDD[_]): Seq[String] = { -prev.context.getPreferredLocs(prev, part.index).map(tl => tl.host) - } - - // this class just keeps iterating and rotating infinitely over the partitions of the RDD - // next() returns the next preferred machine that a partition is replicated on - // the rotator first goes through the first replica copy of each partition, then second, third - // the iterators return type is a tuple: (replicaString, partition) - class LocationIterator(prev: RDD[_]) extends Iterator[(String, Partition)] { - -var it: Iterator[(String, Partition)] = resetIterator() - -override val isEmpty = !it.hasNext - -// initializes/resets to start iterating from the beginning -def resetIterator(): Iterator[(String, Partition)] = { - val iterators = (0 to 2).map { x => -prev.partitions.iterator.flatMap { p => - if (currPrefLocs(p, prev).size > x) Some((currPrefLocs(p, prev)(x), p)) else None + class PartitionLocations(prev: RDD[_]) { + +// contains all the partitions from the previous RDD that don't have preferred locations +val partsWithoutLocs = ArrayBuffer[Partition]() +// contains all the partitions from the previous RDD that have preferred locations +val partsWithLocs: Array[(String, Partition)] = getAllPrefLocs(prev) + +// has side affect of filling in partitions without locations as well +def getAllPrefLocs(prev: RDD[_]): Array[(String, Partition)] = { + val partsWithLocs = mutable.LinkedHashMap[Partition, Seq[String]]() + // first get the locations for each partition, only do this once since it can be expensive + prev.partitions.foreach(p => { + val locs = currPrefLocs(p, prev) + if (locs.size > 0) { +partsWithLocs.put(p, locs) + } else { +partsWithoutLocs += p + } } - } - iterators.reduceLeft((x, y) => x ++ y) + ) + // convert it into an array of host to partition + val allLocs = (0 to 2).map(x => +partsWithLocs.toArray.flatMap(parts => { + val p = parts._1 + val locs = parts._2 + if (locs.size > x) Some((locs(x), p)) else None --- End diff -- We may just append (locs(x), p) to an ArrayBuffer --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-216350597 @tgravescs That's great, could you fix the style? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-216339716 **[Test build #57550 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57550/consoleFull)** for PR 11327 at commit [`2eff583`](https://github.com/apache/spark/commit/2eff583d896b1032477a299aa9ae488711d5f01c). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` class PartitionLocations(prev: RDD[_]) ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-216339722 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57550/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-216339720 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-216339359 Ok, I replaced the location iterator and now get all the preferred locations up front. This made the run time of the this go from around a minute down to around 6 seconds. I kept most of the logic the same and just changed how its getting the locations. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-216339363 **[Test build #57550 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57550/consoleFull)** for PR 11327 at commit [`2eff583`](https://github.com/apache/spark/commit/2eff583d896b1032477a299aa9ae488711d5f01c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-200963748 I think the current implementation does not handle location changing, and we can't. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-200961327 thanks for the feedback. I'm fine with that change and actually had considered it, I just wasn't sure if the intention of the location iterator was to handle the locations changing (in some case I'm not aware of) so I was going the less invasive method of leaving that part the same. If we aren't aware of any issues with that I'll make the changes --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-200959288 @tgravescs The current change is good for your case, I'm thinking that maybe we could do better. The LocationIterator has a bad smell, it may call getPreferredLocs() many times on the same partition, which could be expensive as you mentioned, we should only call getPreferredLocs() on a partition once, by caching the result of it, or get rid of LocationIterator totally. Since we will call getPreferredLocs on every partition of previous RDD, we could eaglely call them in the beginning, partition all the partitions into two groups: with prefered location and without, do different things on them. This approach should solve all the cases that 1) all partition have locs 2) some partitions have locs 3) none of partiions have locs. Hopefully, this could be done in less than 10 seconds in your case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/11327#discussion_r57363138 --- Diff: core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala --- @@ -324,6 +319,40 @@ private class PartitionCoalescer(maxPartitions: Int, prev: RDD[_], balanceSlack: } } } else { + // It is possible to have unionRDD where one rdd has preferred locations and another rdd + // that doesn't. To make sure we end up with the requested number of partitions, + // make sure to put a partitions in every group. + + if (groupArr.size > initialHash.size) { +// we don't have a partition assigned to every group yet so first try to fill them +// with the partitions with preferred locations +var tries = 0 +val rotIt = new LocationIterator(prev) +while (tries < prev.partitions.length && initialHash.size < groupArr.size) { + // if the number of partitions with preferred locations is less then + // number of total partitions this might loop over some more then once but we need to + // handle both cases and its not easy to get # of partitions with preferred locs + var (nxt_replica, nxt_part) = rotIt.next() + if (!initialHash.contains(nxt_part)) { +groupArr.find(pg => pg.size == 0).map(firstEmpty => { + firstEmpty.arr += nxt_part + initialHash += nxt_part +}) + } + tries += 1 +} + } + // we have went through all with preferred locations now just make sure one + // partition per group + val numEmptyPartitionGroups = groupArr.length - getPartitions.length --- End diff -- i split it this way because one is called setupGroups and the throwballs one is put stuff in the groups. I see the fact we put stuff in them in setupgroup as an optimization rather then a necessity. Do you see a benefit to do it there vs here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/11327#discussion_r57361560 --- Diff: core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala --- @@ -324,6 +319,40 @@ private class PartitionCoalescer(maxPartitions: Int, prev: RDD[_], balanceSlack: } } } else { + // It is possible to have unionRDD where one rdd has preferred locations and another rdd + // that doesn't. To make sure we end up with the requested number of partitions, + // make sure to put a partitions in every group. + + if (groupArr.size > initialHash.size) { +// we don't have a partition assigned to every group yet so first try to fill them +// with the partitions with preferred locations +var tries = 0 +val rotIt = new LocationIterator(prev) +while (tries < prev.partitions.length && initialHash.size < groupArr.size) { + // if the number of partitions with preferred locations is less then + // number of total partitions this might loop over some more then once but we need to + // handle both cases and its not easy to get # of partitions with preferred locs + var (nxt_replica, nxt_part) = rotIt.next() + if (!initialHash.contains(nxt_part)) { +groupArr.find(pg => pg.size == 0).map(firstEmpty => { --- End diff -- This is also O(N*N) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/11327#discussion_r57361348 --- Diff: core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala --- @@ -324,6 +319,40 @@ private class PartitionCoalescer(maxPartitions: Int, prev: RDD[_], balanceSlack: } } } else { + // It is possible to have unionRDD where one rdd has preferred locations and another rdd + // that doesn't. To make sure we end up with the requested number of partitions, + // make sure to put a partitions in every group. + + if (groupArr.size > initialHash.size) { +// we don't have a partition assigned to every group yet so first try to fill them +// with the partitions with preferred locations +var tries = 0 +val rotIt = new LocationIterator(prev) +while (tries < prev.partitions.length && initialHash.size < groupArr.size) { + // if the number of partitions with preferred locations is less then + // number of total partitions this might loop over some more then once but we need to + // handle both cases and its not easy to get # of partitions with preferred locs + var (nxt_replica, nxt_part) = rotIt.next() + if (!initialHash.contains(nxt_part)) { +groupArr.find(pg => pg.size == 0).map(firstEmpty => { + firstEmpty.arr += nxt_part + initialHash += nxt_part +}) + } + tries += 1 +} + } + // we have went through all with preferred locations now just make sure one + // partition per group + val numEmptyPartitionGroups = groupArr.length - getPartitions.length --- End diff -- Should we move this into setupGroups(), so we still meet the assumption that all the groups have one partition? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/11327#discussion_r57361377 --- Diff: core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala --- @@ -324,6 +319,40 @@ private class PartitionCoalescer(maxPartitions: Int, prev: RDD[_], balanceSlack: } } } else { + // It is possible to have unionRDD where one rdd has preferred locations and another rdd + // that doesn't. To make sure we end up with the requested number of partitions, + // make sure to put a partitions in every group. + + if (groupArr.size > initialHash.size) { +// we don't have a partition assigned to every group yet so first try to fill them +// with the partitions with preferred locations +var tries = 0 +val rotIt = new LocationIterator(prev) +while (tries < prev.partitions.length && initialHash.size < groupArr.size) { + // if the number of partitions with preferred locations is less then + // number of total partitions this might loop over some more then once but we need to + // handle both cases and its not easy to get # of partitions with preferred locs + var (nxt_replica, nxt_part) = rotIt.next() + if (!initialHash.contains(nxt_part)) { +groupArr.find(pg => pg.size == 0).map(firstEmpty => { + firstEmpty.arr += nxt_part + initialHash += nxt_part +}) + } + tries += 1 +} + } + // we have went through all with preferred locations now just make sure one + // partition per group + val numEmptyPartitionGroups = groupArr.length - getPartitions.length + val partitionsNotInGroups = prev.partitions.filter(p => !initialHash.contains(p)) + for (i <- 0 until math.min(numEmptyPartitionGroups, partitionsNotInGroups.length)) { +groupArr.find(pg => pg.size == 0).map(firstEmpty => { --- End diff -- This is still O(N*N) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-200929309 @tgravescs I'm reviewing this now, sorry for the late. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-200904458 ping @davies @rxin Any chance I can get review on this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-193340890 @tgravescs I did not have enough time to look into the details yet (not familar this part), sorry for the delay. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-193293645 ping @davies was there any other concern or does this look good? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/11327#discussion_r54446793 --- Diff: core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala --- @@ -192,7 +192,8 @@ private class PartitionCoalescer(maxPartitions: Int, prev: RDD[_], balanceSlack: def resetIterator(): Iterator[(String, Partition)] = { val iterators = (0 to 2).map( x => prev.partitions.iterator.flatMap(p => { - if (currPrefLocs(p).size > x) Some((currPrefLocs(p)(x), p)) else None + val locs = currPrefLocs(p) + if (locs.size > x) Some((locs(x), p)) else None --- End diff -- I was just responding to @davies' comment. Not saying anything wrong with this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-190241612 Are there any other comments about functionality? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/11327#discussion_r54418281 --- Diff: core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala --- @@ -192,7 +192,8 @@ private class PartitionCoalescer(maxPartitions: Int, prev: RDD[_], balanceSlack: def resetIterator(): Iterator[(String, Partition)] = { val iterators = (0 to 2).map( x => prev.partitions.iterator.flatMap(p => { - if (currPrefLocs(p).size > x) Some((currPrefLocs(p)(x), p)) else None + val locs = currPrefLocs(p) + if (locs.size > x) Some((locs(x), p)) else None --- End diff -- @rxin I don't follow your first sentence? Second sentence says use size for a collection and Seq is a collection. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/11327#discussion_r54375032 --- Diff: core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala --- @@ -192,7 +192,8 @@ private class PartitionCoalescer(maxPartitions: Int, prev: RDD[_], balanceSlack: def resetIterator(): Iterator[(String, Partition)] = { val iterators = (0 to 2).map( x => prev.partitions.iterator.flatMap(p => { - if (currPrefLocs(p).size > x) Some((currPrefLocs(p)(x), p)) else None + val locs = currPrefLocs(p) + if (locs.size > x) Some((locs(x), p)) else None --- End diff -- It's not size vs length. It's seq.size can sometimes be O(n). For size vs length, we should use length if it is a string or an array, but size if it is a collection. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/11327#discussion_r54115176 --- Diff: core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala --- @@ -192,7 +192,8 @@ private class PartitionCoalescer(maxPartitions: Int, prev: RDD[_], balanceSlack: def resetIterator(): Iterator[(String, Partition)] = { val iterators = (0 to 2).map( x => prev.partitions.iterator.flatMap(p => { - if (currPrefLocs(p).size > x) Some((currPrefLocs(p)(x), p)) else None + val locs = currPrefLocs(p) + if (locs.size > x) Some((locs(x), p)) else None --- End diff -- Are you sure on this? Overall I'm fine with changing but just want to understand for future. locs here is a Seq From scaladoc: The size of this sequence, equivalent to length scala doc says length on sequence also says: Note: will not terminate for infinite-sized collections. Looking at the scala source code: https://github.com/scala/scala/blob/2.10.x/src/library/scala/collection/SeqLike.scala#L106 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/11327#discussion_r54116119 --- Diff: core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala --- @@ -192,7 +192,8 @@ private class PartitionCoalescer(maxPartitions: Int, prev: RDD[_], balanceSlack: def resetIterator(): Iterator[(String, Partition)] = { val iterators = (0 to 2).map( x => prev.partitions.iterator.flatMap(p => { - if (currPrefLocs(p).size > x) Some((currPrefLocs(p)(x), p)) else None + val locs = currPrefLocs(p) + if (locs.size > x) Some((locs(x), p)) else None --- End diff -- cc @rxin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/11327#discussion_r54019069 --- Diff: core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala --- @@ -192,7 +192,8 @@ private class PartitionCoalescer(maxPartitions: Int, prev: RDD[_], balanceSlack: def resetIterator(): Iterator[(String, Partition)] = { val iterators = (0 to 2).map( x => prev.partitions.iterator.flatMap(p => { - if (currPrefLocs(p).size > x) Some((currPrefLocs(p)(x), p)) else None + val locs = currPrefLocs(p) + if (locs.size > x) Some((locs(x), p)) else None --- End diff -- locs.size -> locs.length `size` could be O(N), while `length` is O(1) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-188142561 cc @davies --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-187969886 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-187969889 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51799/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-187969631 **[Test build #51799 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51799/consoleFull)** for PR 11327 at commit [`8665114`](https://github.com/apache/spark/commit/86651146e90d5126d379a4abc6d73a8c6b7a50df). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-187921218 **[Test build #51799 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51799/consoleFull)** for PR 11327 at commit [`8665114`](https://github.com/apache/spark/commit/86651146e90d5126d379a4abc6d73a8c6b7a50df). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-187891776 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-187891779 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51789/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-187891764 **[Test build #51789 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51789/consoleFull)** for PR 11327 at commit [`c9eb032`](https://github.com/apache/spark/commit/c9eb032af8e453a5ba6776279cf0cd6946d0cd55). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-187891331 **[Test build #51789 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51789/consoleFull)** for PR 11327 at commit [`c9eb032`](https://github.com/apache/spark/commit/c9eb032af8e453a5ba6776279cf0cd6946d0cd55). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-187885309 Jenkins, test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-187883014 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-187883016 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51787/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-187875082 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-187875090 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51786/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-187875072 **[Test build #51786 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51786/consoleFull)** for PR 11327 at commit [`afe14dc`](https://github.com/apache/spark/commit/afe14dce508b1e51820f16e33f09c9aa402bca3e). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-187874407 **[Test build #51786 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51786/consoleFull)** for PR 11327 at commit [`afe14dc`](https://github.com/apache/spark/commit/afe14dce508b1e51820f16e33f09c9aa402bca3e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-187858375 Jenkins, test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-187851796 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11316] coalesce doesn't handle UnionRDD...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11327#issuecomment-187851798 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51784/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org