[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-10-07 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-58259774 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21397/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-10-06 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-58109293 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21348/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-10-06 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-58115490 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21348/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-10-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-58115496 Test PASSed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-09-25 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-56780803 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20781/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-09-25 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-56780634 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/151/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-09-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-56780807 Test PASSed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-56770877 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20779/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-09-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-56773667 Test FAILed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-56773664 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20779/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-56776628 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/151/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-09-24 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-56776706 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20781/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-09-18 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-56002944 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20530/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-09-18 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-56004283 @JoshRosen @mateiz I had addressed all your comments. The IResulterIterator can be iterated multiple times now, also can be pickled. --- If your project is set up for

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-09-18 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-56009374 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20530/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-09-17 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/1977#discussion_r17678700 --- Diff: python/pyspark/rdd.py --- @@ -1562,21 +1560,34 @@ def createZero(): return self.combineByKey(lambda v: func(createZero(), v),

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-09-17 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-55928306 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/125/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-09-17 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-55928984 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20470/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-09-17 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/1977#discussion_r17680613 --- Diff: python/pyspark/rdd.py --- @@ -1562,21 +1560,34 @@ def createZero(): return self.combineByKey(lambda v: func(createZero(), v),

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-09-17 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/1977#discussion_r17681171 --- Diff: python/pyspark/rdd.py --- @@ -1588,8 +1599,27 @@ def mergeCombiners(a, b): a.extend(b) return a -

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-09-17 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-55938918 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/125/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-09-17 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-55939676 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20470/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-09-17 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-55958078 Summarizing some of our in-person discussion (@davies, let me know if I've made any mistakes here!): `GroupByKey` and `SameKey` work together to address the

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-09-17 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-55958996 This looks like a good patch. The code here is fairly complicated and had some complex control flow, although after discussion I believe that it works correctly. It

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-09-17 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-55959414 There's a bit of code duplication between ExternalGroupBy and ExternalMerger, but maybe this is unavoidable. It would be nice to add a short comment to

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-09-17 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-55978000 Can you guys also add some tests that do rdd.groupByKey().filter().map(), and skip some of the groups? As well as tests that iterate over the values in a SameKey object

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-09-17 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-55978058 Also, can you see what happens when you do rdd.groupByKey().cache()? Can we serialize and deserialize these objects back to Scala-land? It's okay if we can't cache overly

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-09-17 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-55982725 @mateiz In this patch, the values in SameKey can only be iterated once, I will fix this later. --- If your project is set up for it, you can reply to this email and

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-09-17 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-55984553 We can't merge this patch until that's done then, because that would be a regression. In general we try to keep even master free of regressions because quite a few people

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-09-15 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-6598 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20328/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-09-15 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-55560776 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20328/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-09-14 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-55519113 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20305/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-09-14 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-2290 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/106/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-09-14 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-4117 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/106/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-08-29 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-53844028 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-08-29 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-53844406 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19461/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-08-29 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-53848815 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19461/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-08-27 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-53648492 @mateiz @JoshRosen I think this PR is ready for review, it helped user to do groupByKey() over 120G dataset with the hottest key which has more than 80 millions values.

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-08-26 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-53515747 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19262/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-08-26 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-53522590 **Tests timed out** after a configured wait of `120m`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-08-20 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-52820243 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18970/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-08-20 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-52828711 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18970/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-08-20 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-52859823 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-08-20 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-52859847 cc @mateiz @JoshRosen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-08-20 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-52860133 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19014/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-08-20 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-52860780 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19015/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-08-20 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-52864579 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19014/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-08-20 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-52865158 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19015/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-08-19 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-52599265 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18823/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-08-19 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-52599524 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18815/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-08-19 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-52603829 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18823/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-08-19 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-52712146 cc @JoshRosen @mateiz --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-08-19 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-52717422 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18897/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-08-19 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-52717442 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18897/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-08-19 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-52717915 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18899/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-08-19 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-52717927 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18899/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-08-19 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-52721511 Jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-08-19 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-52721772 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18908/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-08-19 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-52725175 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18908/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-08-16 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-52386192 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18667/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-08-16 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-52386913 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18667/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-08-16 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-52393277 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-08-16 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-52393443 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18674/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-08-16 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-52394905 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18674/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-08-15 Thread davies
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/1977 [SPARK-3074] [PySpark] support groupByKey() with single huge key This patch change groupByKey() to use external sort based approach, so it can support single huge key. You can merge this pull

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-08-15 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-52376616 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18642/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-08-15 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-5234 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18646/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-08-15 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-52378307 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18642/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-08-15 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-52379142 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18646/consoleFull) for PR 1977 at commit

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-08-15 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-52381086 Does / will the same functionality exist in Scala/Java? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-08-15 Thread andrewor14
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-52383067 I believe this is one of those few things in Spark where python is ahead of Scala --- If your project is set up for it, you can reply to this email and have your

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-08-15 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-52384600 @sryza There are similar things in Scala, but we can not compare the Python object in Scala, so it can not use the groupByKey() in Scala directly. All the aggregation

<    1   2