[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-25 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-50213171 QA results for PR 1499:- This patch FAILED unit tests.- This patch merges cleanly- This patch adds the following public classes (experimental):class ShuffledRDD[K, V, C](

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-25 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-50210371 QA tests have started for PR 1499. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17202/consoleFull --- If

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-25 Thread markhamstra
Github user markhamstra commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-50160338 After installing `hub` you can also do a bunch of new stuff on the command line, including `hub checkout https://github.com/apache/spark/pull/1499` https://hu

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-25 Thread aarondav
Github user aarondav commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-50159041 Not sure if it helps, but you can get PRs via something like `git fetch apache refs/pull/1499/head`. Easier workflow, perhaps. On Jul 25, 2014 12:25 AM, "Mridul Mu

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-25 Thread mridulm
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-50116492 ah, thanks ! rerunning with 9c29957. cant pull the pr - and manual merge is painful, hence delays in testing :-) --- If your project is set up for it, you can reply t

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-25 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15388684 --- Diff: core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleManager.scala --- @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Found

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-25 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-50115862 @mridulm that should've been fixed recently in https://github.com/mateiz/spark/commit/9c299579f13f004f5fd1f4dd0b98b7d76cac2a55, which got rid of custom return types in Shu

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-25 Thread mridulm
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-50115453 BTW, this is one of 5 failures from core. I hope there are no merge issues though, --- If your project is set up for it, you can reply to this email and have your rep

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-25 Thread mridulm
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-50115353 Running tests with export SPARK_JAVA_OPTS="-Dspark.shuffle.manager=org.apache.spark.shuffle.sort.SortShuffleManager" causes : ''' - sorting using mutable pai

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-24 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15383498 --- Diff: core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleWriter.scala --- @@ -0,0 +1,156 @@ +/* + * Licensed to the Apache Software Found

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-24 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15382131 --- Diff: core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleManager.scala --- @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Found

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-24 Thread lianhuiwang
Github user lianhuiwang commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15352745 --- Diff: core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleWriter.scala --- @@ -0,0 +1,156 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-24 Thread colorant
Github user colorant commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15333148 --- Diff: core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleManager.scala --- @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Fou

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-23 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-49973755 Added one more commit that fixes the type of ShuffledRDD, because in this new shuffle it's not possible to return a custom Product2 the way it's written now, and in the ol

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-23 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15332387 --- Diff: core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleManager.scala --- @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Found

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-23 Thread colorant
Github user colorant commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15331341 --- Diff: core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleManager.scala --- @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Fou

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-23 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-49961533 QA results for PR 1499:- This patch FAILED unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.c

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-23 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-49956204 QA tests have started for PR 1499. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17078/consoleFull --- If

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-23 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15324369 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala --- @@ -120,8 +124,10 @@ class ExternalAppendOnlyMap[K, V, C](

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-23 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-49949874 Let me know if I've missed something. That combining part happens in mergeWithAggregation. Actually for many operations map-side combine is disabled so we don't even do it

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-23 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-49949660 No, we only read one value at a time. If you're doing something like groupByKey, the combine function may create an ArrayBuffer or stuff like that, but that's only for the

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-23 Thread mridulm
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-49949511 @mateiz The total memory overhead actually goes much higher than num_streams right ? It should be order of num_streams + num_values for this key. For fairly l

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-23 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-49938524 QA results for PR 1499:- This patch FAILED unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.c

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-23 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-49925284 QA tests have started for PR 1499. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17055/consoleFull --- If

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-23 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-49924823 @colorant @mridulm for the serializing part, one issue is that as we merge streams we already need a heap of O(num streams) objects, so if you're worried about the objects

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-23 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15310561 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,649 @@ +/* + * Licensed to the Apache Software Found

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-23 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15310488 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,649 @@ +/* + * Licensed to the Apache Software Found

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-23 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15288486 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,649 @@ +/* + * Licensed to the Apache Software Foun

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-23 Thread colorant
Github user colorant commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15277549 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,649 @@ +/* + * Licensed to the Apache Software Fou

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-23 Thread colorant
Github user colorant commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15277089 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,649 @@ +/* + * Licensed to the Apache Software Fou

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-23 Thread colorant
Github user colorant commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15276601 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,649 @@ +/* + * Licensed to the Apache Software Fou

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-23 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15274240 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,649 @@ +/* + * Licensed to the Apache Software Foun

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-22 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15271480 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,573 @@ +/* + * Licensed to the Apache Software Found

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-22 Thread mridulm
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-49833579 I had pulled about 20 mins after I mailed you ... I have elaborated on why this occurs inline in the code - we can ignore it for now though, since it happens even in '

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-22 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-49826834 @mridulm which version of the code was that with? Right now line 526 of ExternalSorter is not calling readObject, so it's hard to debug. There might've been some fixes to

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-49811607 QA results for PR 1499:- This patch FAILED unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.c

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-22 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15259190 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,573 @@ +/* + * Licensed to the Apache Software Foun

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-22 Thread mridulm
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15259118 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,573 @@ +/* + * Licensed to the Apache Software Foun

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-22 Thread mridulm
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-49804353 We saw a bunch of EOF Exceptions from SpillReader. java.io.EOFException at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.j

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-49802023 QA results for PR 1499:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.c

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-49800606 QA tests have started for PR 1499. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16990/consoleFull --- If

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-49788955 QA tests have started for PR 1499. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16983/consoleFull --- If

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-49715947 QA results for PR 1499:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.c

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-49707121 QA tests have started for PR 1499. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16948/consoleFull --- If

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-22 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-49706936 @aarondav pushed an update to the grow code as well, which will now use estimateSize. I implemented the same thing in EAOM. @colorant fixed those, thanks. --- If

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-21 Thread colorant
Github user colorant commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15209013 --- Diff: core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleWriter.scala --- @@ -0,0 +1,156 @@ +/* + * Licensed to the Apache Software Fou

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-21 Thread colorant
Github user colorant commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15208983 --- Diff: core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleManager.scala --- @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Fou

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-21 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-49669437 Actually I missed one comment, the one on the condition for spilling; I'll have to fix that both in ExternalSorter and EAOM --- If your project is set up for it, you can

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-21 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-49669025 I've now updated it with I believe all of @aarondav's fixes and an optimization he suggested for in-memory-only data (don't bother sorting by key or merge-sorting).

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-21 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15196170 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,573 @@ +/* + * Licensed to the Apache Software Found

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-21 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15195282 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,573 @@ +/* + * Licensed to the Apache Software Found

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-21 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15194652 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,573 @@ +/* + * Licensed to the Apache Software Found

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-21 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15193928 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,573 @@ +/* + * Licensed to the Apache Software Found

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-21 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15192894 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,573 @@ +/* + * Licensed to the Apache Software Found

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-21 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15192915 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,573 @@ +/* + * Licensed to the Apache Software Found

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-21 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15157645 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,573 @@ +/* + * Licensed to the Apache Software Found

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-21 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15157617 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,573 @@ +/* + * Licensed to the Apache Software Found

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread aarondav
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15154547 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,573 @@ +/* + * Licensed to the Apache Software Fou

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread aarondav
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15154525 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,573 @@ +/* + * Licensed to the Apache Software Fou

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread aarondav
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15154509 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,573 @@ +/* + * Licensed to the Apache Software Fou

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread aarondav
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15154485 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,573 @@ +/* + * Licensed to the Apache Software Fou

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread aarondav
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15154478 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,573 @@ +/* + * Licensed to the Apache Software Fou

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread aarondav
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15154451 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,573 @@ +/* + * Licensed to the Apache Software Fou

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread aarondav
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15154433 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,573 @@ +/* + * Licensed to the Apache Software Fou

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread aarondav
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15154394 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,573 @@ +/* + * Licensed to the Apache Software Fou

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread aarondav
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15154387 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,573 @@ +/* + * Licensed to the Apache Software Fou

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread aarondav
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15154344 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,573 @@ +/* + * Licensed to the Apache Software Fou

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread aarondav
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15154289 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,573 @@ +/* + * Licensed to the Apache Software Fou

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread aarondav
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15154239 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,573 @@ +/* + * Licensed to the Apache Software Fou

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread aarondav
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15154224 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,573 @@ +/* + * Licensed to the Apache Software Fou

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread aarondav
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15154203 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,573 @@ +/* + * Licensed to the Apache Software Fou

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread aarondav
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15154182 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,573 @@ +/* + * Licensed to the Apache Software Fou

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread aarondav
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15154163 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,573 @@ +/* + * Licensed to the Apache Software Fou

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread aarondav
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15154080 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,573 @@ +/* + * Licensed to the Apache Software Fou

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread aarondav
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15154006 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,573 @@ +/* + * Licensed to the Apache Software Fou

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-49568383 QA results for PR 1499:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds the following public classes (experimental):trait SizeTrackingCollectio

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-49567523 QA results for PR 1499:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds the following public classes (experimental):trait SizeTrackingCollectio

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-49565532 I've now updated this to support partial aggregation across spilled files and even if we don't have an Ordering, using hash code comparison similar to ExternalAppendOnlyMa

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-49565536 QA tests have started for PR 1499. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16886/consoleFull --- If

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15153152 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,390 @@ +/* + * Licensed to the Apache Software Found

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15153139 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,390 @@ +/* + * Licensed to the Apache Software Found

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-49564761 QA tests have started for PR 1499. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16885/consoleFull --- If

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-49564210 QA results for PR 1499:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds the following public classes (experimental):trait SizeTrackingCollectio

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-49561754 QA tests have started for PR 1499. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16882/consoleFull --- If

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread aarondav
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15152282 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,390 @@ +/* + * Licensed to the Apache Software Fou

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread aarondav
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15152231 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,390 @@ +/* + * Licensed to the Apache Software Fou

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread aarondav
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15152216 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,390 @@ +/* + * Licensed to the Apache Software Fou

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread aarondav
Github user aarondav commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15152201 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,390 @@ +/* + * Licensed to the Apache Software Fou

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-49558215 QA results for PR 1499:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds the following public classes (experimental):trait SizeTrackingCollectio

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-49555343 QA tests have started for PR 1499. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16878/consoleFull --- If

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-49541476 QA results for PR 1499:- This patch FAILED unit tests.- This patch merges cleanly- This patch adds the following public classes (experimental):trait SizeTrackingCollectio

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-49541073 QA results for PR 1499:- This patch FAILED unit tests.- This patch merges cleanly- This patch adds the following public classes (experimental):trait SizeTrackingCollectio

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread rxin
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15148086 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,390 @@ +/* + * Licensed to the Apache Software Foundat

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread rxin
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15148079 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,390 @@ +/* + * Licensed to the Apache Software Foundat

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-49539656 QA tests have started for PR 1499. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16865/consoleFull --- If

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15147904 --- Diff: core/src/main/scala/org/apache/spark/util/collection/SizeTrackingBuffer.scala --- @@ -0,0 +1,150 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-49539437 QA tests have started for PR 1499. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16864/consoleFull --- If

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread mateiz
GitHub user mateiz opened a pull request: https://github.com/apache/spark/pull/1499 (WIP) SPARK-2045 Sort-based shuffle This adds a new ShuffleManager based on sorting, as described in https://issues.apache.org/jira/browse/SPARK-2045. The bulk of the code is in an ExternalSorter cl

[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-20 Thread mateiz
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1499#discussion_r15147889 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -0,0 +1,390 @@ +/* + * Licensed to the Apache Software Found