Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-58259774
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21397/consoleFull)
for PR 1977 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-58109293
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21348/consoleFull)
for PR 1977 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-58115490
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21348/consoleFull)
for PR 1977 at commit
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-58115496
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-56780803
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20781/consoleFull)
for PR 1977 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-56780634
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/151/consoleFull)
for PR 1977 at commit
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-56780807
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-56770877
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20779/consoleFull)
for PR 1977 at commit
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-56773667
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-56773664
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20779/consoleFull)
for PR 1977 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-56776628
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/151/consoleFull)
for PR 1977 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-56776706
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20781/consoleFull)
for PR 1977 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-56002944
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20530/consoleFull)
for PR 1977 at commit
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-56004283
@JoshRosen @mateiz I had addressed all your comments. The
IResulterIterator can be iterated multiple times now, also can be pickled.
---
If your project is set up for
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-56009374
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20530/consoleFull)
for PR 1977 at commit
Github user JoshRosen commented on a diff in the pull request:
https://github.com/apache/spark/pull/1977#discussion_r17678700
--- Diff: python/pyspark/rdd.py ---
@@ -1562,21 +1560,34 @@ def createZero():
return self.combineByKey(lambda v: func(createZero(), v),
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-55928306
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/125/consoleFull)
for PR 1977 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-55928984
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20470/consoleFull)
for PR 1977 at commit
Github user JoshRosen commented on a diff in the pull request:
https://github.com/apache/spark/pull/1977#discussion_r17680613
--- Diff: python/pyspark/rdd.py ---
@@ -1562,21 +1560,34 @@ def createZero():
return self.combineByKey(lambda v: func(createZero(), v),
Github user JoshRosen commented on a diff in the pull request:
https://github.com/apache/spark/pull/1977#discussion_r17681171
--- Diff: python/pyspark/rdd.py ---
@@ -1588,8 +1599,27 @@ def mergeCombiners(a, b):
a.extend(b)
return a
-
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-55938918
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/125/consoleFull)
for PR 1977 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-55939676
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20470/consoleFull)
for PR 1977 at commit
Github user JoshRosen commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-55958078
Summarizing some of our in-person discussion (@davies, let me know if I've
made any mistakes here!):
`GroupByKey` and `SameKey` work together to address the
Github user JoshRosen commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-55958996
This looks like a good patch. The code here is fairly complicated and had
some complex control flow, although after discussion I believe that it works
correctly. It
Github user JoshRosen commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-55959414
There's a bit of code duplication between ExternalGroupBy and
ExternalMerger, but maybe this is unavoidable. It would be nice to add a short
comment to
Github user mateiz commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-55978000
Can you guys also add some tests that do rdd.groupByKey().filter().map(),
and skip some of the groups? As well as tests that iterate over the values in a
SameKey object
Github user mateiz commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-55978058
Also, can you see what happens when you do rdd.groupByKey().cache()? Can we
serialize and deserialize these objects back to Scala-land? It's okay if we
can't cache overly
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-55982725
@mateiz In this patch, the values in SameKey can only be iterated once, I
will fix this later.
---
If your project is set up for it, you can reply to this email and
Github user mateiz commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-55984553
We can't merge this patch until that's done then, because that would be a
regression. In general we try to keep even master free of regressions because
quite a few people
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-6598
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20328/consoleFull)
for PR 1977 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-55560776
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20328/consoleFull)
for PR 1977 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-55519113
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20305/consoleFull)
for PR 1977 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-2290
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/106/consoleFull)
for PR 1977 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-4117
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/106/consoleFull)
for PR 1977 at commit
Github user JoshRosen commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-53844028
Jenkins, retest this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-53844406
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19461/consoleFull)
for PR 1977 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-53848815
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19461/consoleFull)
for PR 1977 at commit
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-53648492
@mateiz @JoshRosen I think this PR is ready for review, it helped user to
do groupByKey() over 120G dataset with the hottest key which has more than 80
millions values.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-53515747
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19262/consoleFull)
for PR 1977 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-53522590
**Tests timed out** after a configured wait of `120m`.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-52820243
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18970/consoleFull)
for PR 1977 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-52828711
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18970/consoleFull)
for PR 1977 at commit
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-52859823
Jenkins, test this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-52859847
cc @mateiz @JoshRosen
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-52860133
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19014/consoleFull)
for PR 1977 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-52860780
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19015/consoleFull)
for PR 1977 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-52864579
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19014/consoleFull)
for PR 1977 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-52865158
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19015/consoleFull)
for PR 1977 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-52599265
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18823/consoleFull)
for PR 1977 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-52599524
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18815/consoleFull)
for PR 1977 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-52603829
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18823/consoleFull)
for PR 1977 at commit
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-52712146
cc @JoshRosen @mateiz
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-52717422
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18897/consoleFull)
for PR 1977 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-52717442
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18897/consoleFull)
for PR 1977 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-52717915
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18899/consoleFull)
for PR 1977 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-52717927
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18899/consoleFull)
for PR 1977 at commit
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-52721511
Jenkins, retest this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-52721772
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18908/consoleFull)
for PR 1977 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-52725175
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18908/consoleFull)
for PR 1977 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-52386192
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18667/consoleFull)
for PR 1977 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-52386913
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18667/consoleFull)
for PR 1977 at commit
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-52393277
Jenkins, retest this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-52393443
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18674/consoleFull)
for PR 1977 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-52394905
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18674/consoleFull)
for PR 1977 at commit
GitHub user davies opened a pull request:
https://github.com/apache/spark/pull/1977
[SPARK-3074] [PySpark] support groupByKey() with single huge key
This patch change groupByKey() to use external sort based approach, so it
can support single huge key.
You can merge this pull
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-52376616
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18642/consoleFull)
for PR 1977 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-5234
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18646/consoleFull)
for PR 1977 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-52378307
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18642/consoleFull)
for PR 1977 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-52379142
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18646/consoleFull)
for PR 1977 at commit
Github user sryza commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-52381086
Does / will the same functionality exist in Scala/Java?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If
Github user andrewor14 commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-52383067
I believe this is one of those few things in Spark where python is ahead of
Scala
---
If your project is set up for it, you can reply to this email and have your
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-52384600
@sryza There are similar things in Scala, but we can not compare the Python
object in Scala, so it can not use the groupByKey() in Scala directly. All the
aggregation
101 - 172 of 172 matches
Mail list logo