Github user viirya commented on the issue:
https://github.com/apache/spark/pull/21498
I'd like to close this for now. Wait for necessary change on statistics.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.a
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/21498
> In theory, this should be done in a cost-based style. Changing the way
how union combines data will reduce the parallelism.
> For example, if we union 2 tables each has 5 partitions. Without thi
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/21498
In theory, this should be done in a cost-based style. Changing the way how
union combines data will reduce the parallelism.
For example, if we union 2 tables each has 5 partitions. Without
Github user mgaido91 commented on the issue:
https://github.com/apache/spark/pull/21498
@viirya sorry, I somehow lost your updated benchmark. Yes, it makes sense.
In the case without any shuffle needed after the union we have about a 2%
performance regression. I am not sure about the
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/21498
@mgaido91 WDYT? Does the benchmark make sense to you?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For addit
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/21498
Benchmarking on a Spark cluster with 5 nodes on EC2 too.
```scala
def benchmark(func: () => Unit): Unit = {
val t0 = System.nanoTime()
func()
val t1 = System.nanoTi
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/21498
@mgaido91 Ok. I will try to have another one.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional co
Github user mgaido91 commented on the issue:
https://github.com/apache/spark/pull/21498
Thanks for you benchmark @viirya. The performance improvement is sensible.
And seems no performance regression in the other case. Can we have a similar
benchmark also with records with more complex
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/21498
I set up a Spark cluster with 5 nodes on EC2.
```scala
def benchmark(func: () => Unit): Unit = {
val t0 = System.nanoTime()
func()
val t1 = System.nanoTime()
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/21498
> In aggregation we are replacing a needed shuffle with gathering only the
needed rows from the other partitions.
I don't know what this means actually. If we decided we don't need a
shuffle
Github user mgaido91 commented on the issue:
https://github.com/apache/spark/pull/21498
> Because they have same partitioning, for example, I suppose that first
partitions of all RDDs are located at the same place?
I really don't think so.
In aggregation we are replac
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/21498
When the condition is satisfied and we know children of Union have same
partitioning, this goes to let the first partition of union result includes
first partitions of children RDDs, and 2nd, 3rd par
Github user mgaido91 commented on the issue:
https://github.com/apache/spark/pull/21498
@viirya I may be wrong, but I am not sure about the performance improvement
brought by this. The goal here is to avoid a shuffle after the `union` operator
(when it is followed by operators requiri
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/21498
@cloud-fan This is removed of WIP and can be review now. Please take a look
when you are available, as supposed that you'll be busy in this week. Thanks.
---
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21498
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional comma
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21498
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91497/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21498
**[Test build #91497 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91497/testReport)**
for PR 21498 at commit
[`69a7066`](https://github.com/apache/spark/commit/6
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21498
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91495/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21498
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional comma
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21498
**[Test build #91495 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91495/testReport)**
for PR 21498 at commit
[`0dedf44`](https://github.com/apache/spark/commit/0
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21498
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91496/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21498
Merged build finished. Test FAILed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional comma
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21498
**[Test build #91496 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91496/testReport)**
for PR 21498 at commit
[`6f487c9`](https://github.com/apache/spark/commit/6
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21498
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional comma
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21498
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91493/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21498
**[Test build #91493 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91493/testReport)**
for PR 21498 at commit
[`b058f89`](https://github.com/apache/spark/commit/b
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21498
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3821/
Tes
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21498
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional comma
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/21498
Tests are added. cc @kiszk @mgaido91
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands,
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21498
**[Test build #91497 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91497/testReport)**
for PR 21498 at commit
[`69a7066`](https://github.com/apache/spark/commit/69
30 matches
Mail list logo