Github user rxin commented on the issue:
https://github.com/apache/spark/pull/22010
If this is not yet in 2.4 it shouldnât be merged now.
On Wed, Oct 10, 2018 at 10:57 AM Holden Karau
wrote:
> Open question: is this suitable for branch-2.4 since it predates the
Github user holdenk commented on the issue:
https://github.com/apache/spark/pull/22010
Open question: is this suitable for branch-2.4 since it predates the branch
cut or not? (I know we've gone back and forth on how we do that).
---
--
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/22010
thanks, merging to master!
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail:
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96680/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional comma
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22010
**[Test build #96680 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96680/testReport)**
for PR 22010 at commit
[`95357cf`](https://github.com/apache/spark/commit/9
Github user mgaido91 commented on the issue:
https://github.com/apache/spark/pull/22010
LGTM
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apac
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional comma
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3529/
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22010
**[Test build #96680 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96680/testReport)**
for PR 22010 at commit
[`95357cf`](https://github.com/apache/spark/commit/95
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/22010
retest this please
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22010
**[Test build #96669 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96669/testReport)**
for PR 22010 at commit
[`95357cf`](https://github.com/apache/spark/commit/9
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96669/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Merged build finished. Test FAILed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional comma
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3519/
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Merged build finished. Test FAILed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional comma
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22010
**[Test build #96669 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96669/testReport)**
for PR 22010 at commit
[`95357cf`](https://github.com/apache/spark/commit/95
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/22010
retest this please
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96652/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Merged build finished. Test FAILed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional comma
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22010
**[Test build #96652 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96652/testReport)**
for PR 22010 at commit
[`95357cf`](https://github.com/apache/spark/commit/9
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional comma
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3505/
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22010
**[Test build #96652 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96652/testReport)**
for PR 22010 at commit
[`95357cf`](https://github.com/apache/spark/commit/95
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/22010
retest this please
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Merged build finished. Test FAILed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional comma
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96641/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22010
**[Test build #96641 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96641/testReport)**
for PR 22010 at commit
[`95357cf`](https://github.com/apache/spark/commit/9
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3496/
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22010
**[Test build #96641 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96641/testReport)**
for PR 22010 at commit
[`95357cf`](https://github.com/apache/spark/commit/95
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional comma
Github user holdenk commented on the issue:
https://github.com/apache/spark/pull/22010
I'll leave this until Friday morning (pacific) in case anyone has last
minute comments. cc @rxin / @HyukjinKwon / @mgaido91
---
Github user holdenk commented on the issue:
https://github.com/apache/spark/pull/22010
So by running `sc.parallelize(1.to(1000)).map(x => (x % 10,
x)).sortByKey().distinct().count()` in 2.3.0 and my PR we can see the
difference:
![240_proposed_distinct_screenshot from 2018-09-26
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional comma
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96583/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22010
**[Test build #96583 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96583/testReport)**
for PR 22010 at commit
[`849f67b`](https://github.com/apache/spark/commit/8
Github user holdenk commented on the issue:
https://github.com/apache/spark/pull/22010
Did another quick micro benchmark on a small cluster:
```scala
import org.apache.spark.util.collection.ExternalAppendOnlyMap
def removeDuplicatesInPartition(partition: Iterator[(
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional comma
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3454/
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22010
**[Test build #96583 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96583/testReport)**
for PR 22010 at commit
[`849f67b`](https://github.com/apache/spark/commit/84
Github user holdenk commented on the issue:
https://github.com/apache/spark/pull/22010
@cloud-fan yeah that's totally an option. Since @rxin asked for it to use
`reduceByKey` I went with that approach, but I'd be happy to use the
`ExternalAppendOnlyMap` if that's ok with folks.
---
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/22010
I think this works, can we post some Spark web UI screenshots to confirm
the shuffle is indeed eliminated?
BTW one idea to simplify the implementation:
```
def distinct(numPartitio
Github user holdenk commented on the issue:
https://github.com/apache/spark/pull/22010
@rxin So that RDD could not exist with a known partitioner (regardless of
range-based or hash based the partitioner must be deterministic so two elements
with the same key must go to the same partit
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/22010
Actually @holdenk is this change even correct? RDD.distinct is not key
based. It is based on the value of the elements in RDD. Even if `numPartitions
== partitions.length`, it doesn't mean the RDD is h
Github user holdenk commented on the issue:
https://github.com/apache/spark/pull/22010
Hey @rxin & @cloud-fan I'd really appreciate your input on the tricks I did
to keep the partioniner information present -- is this the right approach?
---
-
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional comma
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95768/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22010
**[Test build #95768 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95768/testReport)**
for PR 22010 at commit
[`4c89653`](https://github.com/apache/spark/commit/4
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95767/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Build finished. Test FAILed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22010
**[Test build #95767 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95767/testReport)**
for PR 22010 at commit
[`7ed7589`](https://github.com/apache/spark/commit/7
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2911/
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2910/
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional comma
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22010
**[Test build #95768 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95768/testReport)**
for PR 22010 at commit
[`4c89653`](https://github.com/apache/spark/commit/4c
Github user holdenk commented on the issue:
https://github.com/apache/spark/pull/22010
Updated to use reduceByKey. I'd really appreciate feedback on if adding the
param to `MapPartitionsRDD` was the way to go or if I should sub-class it
instead.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22010
**[Test build #95767 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95767/testReport)**
for PR 22010 at commit
[`7ed7589`](https://github.com/apache/spark/commit/7e
Github user mgaido91 commented on the issue:
https://github.com/apache/spark/pull/22010
thanks for checking @rxin @cloud-fan
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/22010
I am sorry guys. I rushed to take a look.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional c
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/22010
While this saves a shuffle, but the algorithm becomes different. Previously
we use the shuffe aggregator, which stores data in a `ExternalAppendOnlyMap`.
Now we use a scala set, which may OOM.
-
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/22010
Thanks for pinging. Please don't merge this until you've addressed the OOM
issue. The aggregators were created to handle incoming data larger than size of
memory. We should never use a Scala or Java ha
Github user holdenk commented on the issue:
https://github.com/apache/spark/pull/22010
I think this is pretty clearly a win, but since it's been awhile since I
did anything in core I'll leave this until Friday morning (pacific) in-case any
of the committers who've been working there h
Github user holdenk commented on the issue:
https://github.com/apache/spark/pull/22010
I did a quick micro-benchmark on this and got:
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
>
> import scala.collection.{mutable, Map}
> def removeDuplicatesIn
Github user holdenk commented on the issue:
https://github.com/apache/spark/pull/22010
@HyukjinKwon sure, I'll do a micro benchmark sometime this week.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.o
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/22010
Logically looks right but would you mind if I ask a simple benchmark
@holdenk just to make everything clear?
---
-
To unsubs
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional comma
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94634/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22010
**[Test build #94634 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94634/testReport)**
for PR 22010 at commit
[`5fd3659`](https://github.com/apache/spark/commit/5
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22010
**[Test build #94634 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94634/testReport)**
for PR 22010 at commit
[`5fd3659`](https://github.com/apache/spark/commit/5f
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2091/
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional comma
Github user holdenk commented on the issue:
https://github.com/apache/spark/pull/22010
Test failure is streaming timeout, likely unrelated. Jenkins retest this
please.
---
-
To unsubscribe, e-mail: reviews-unsubscr.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94607/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Merged build finished. Test FAILed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional comma
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22010
**[Test build #94607 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94607/testReport)**
for PR 22010 at commit
[`5fd3659`](https://github.com/apache/spark/commit/5
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22010
**[Test build #94607 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94607/testReport)**
for PR 22010 at commit
[`5fd3659`](https://github.com/apache/spark/commit/5f
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional comma
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2076/
Github user mgaido91 commented on the issue:
https://github.com/apache/spark/pull/22010
retest this please
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Merged build finished. Test FAILed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional comma
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94577/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22010
**[Test build #94577 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94577/testReport)**
for PR 22010 at commit
[`5fd3659`](https://github.com/apache/spark/commit/5
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22010
**[Test build #94577 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94577/testReport)**
for PR 22010 at commit
[`5fd3659`](https://github.com/apache/spark/commit/5f
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional comma
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2053/
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94303/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional comma
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22010
**[Test build #94303 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94303/testReport)**
for PR 22010 at commit
[`a7fbc74`](https://github.com/apache/spark/commit/a
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22010
**[Test build #94303 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94303/testReport)**
for PR 22010 at commit
[`a7fbc74`](https://github.com/apache/spark/commit/a7
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1861/
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22010
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional comma
92 matches
Mail list logo