[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

2018-07-16 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/21498 I'd like to close this for now. Wait for necessary change on statistics. --- - To unsubscribe, e-mail:

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

2018-06-12 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/21498 > In theory, this should be done in a cost-based style. Changing the way how union combines data will reduce the parallelism. > For example, if we union 2 tables each has 5 partitions. Without

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

2018-06-12 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21498 In theory, this should be done in a cost-based style. Changing the way how union combines data will reduce the parallelism. For example, if we union 2 tables each has 5 partitions.

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

2018-06-12 Thread mgaido91
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/21498 @viirya sorry, I somehow lost your updated benchmark. Yes, it makes sense. In the case without any shuffle needed after the union we have about a 2% performance regression. I am not sure about

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

2018-06-12 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/21498 @mgaido91 WDYT? Does the benchmark make sense to you? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

2018-06-07 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/21498 Benchmarking on a Spark cluster with 5 nodes on EC2 too. ```scala def benchmark(func: () => Unit): Unit = { val t0 = System.nanoTime() func() val t1 =

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

2018-06-07 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/21498 @mgaido91 Ok. I will try to have another one. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

2018-06-07 Thread mgaido91
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/21498 Thanks for you benchmark @viirya. The performance improvement is sensible. And seems no performance regression in the other case. Can we have a similar benchmark also with records with more

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

2018-06-06 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/21498 I set up a Spark cluster with 5 nodes on EC2. ```scala def benchmark(func: () => Unit): Unit = { val t0 = System.nanoTime() func() val t1 = System.nanoTime()

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

2018-06-06 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/21498 > In aggregation we are replacing a needed shuffle with gathering only the needed rows from the other partitions. I don't know what this means actually. If we decided we don't need a

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

2018-06-06 Thread mgaido91
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/21498 > Because they have same partitioning, for example, I suppose that first partitions of all RDDs are located at the same place? I really don't think so. In aggregation we are

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

2018-06-06 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/21498 When the condition is satisfied and we know children of Union have same partitioning, this goes to let the first partition of union result includes first partitions of children RDDs, and 2nd, 3rd

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

2018-06-06 Thread mgaido91
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/21498 @viirya I may be wrong, but I am not sure about the performance improvement brought by this. The goal here is to avoid a shuffle after the `union` operator (when it is followed by operators

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

2018-06-06 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/21498 @cloud-fan This is removed of WIP and can be review now. Please take a look when you are available, as supposed that you'll be busy in this week. Thanks. ---

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

2018-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21498 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

2018-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21498 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91497/ Test PASSed. ---

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

2018-06-06 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21498 **[Test build #91497 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91497/testReport)** for PR 21498 at commit

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

2018-06-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21498 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91495/ Test PASSed. ---

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

2018-06-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21498 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

2018-06-05 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21498 **[Test build #91495 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91495/testReport)** for PR 21498 at commit

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

2018-06-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21498 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91496/ Test FAILed. ---

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

2018-06-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21498 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

2018-06-05 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21498 **[Test build #91496 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91496/testReport)** for PR 21498 at commit

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

2018-06-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21498 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

2018-06-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21498 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91493/ Test PASSed. ---

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

2018-06-05 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21498 **[Test build #91493 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91493/testReport)** for PR 21498 at commit

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

2018-06-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21498 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3821/

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

2018-06-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21498 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

2018-06-05 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/21498 Tests are added. cc @kiszk @mgaido91 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands,

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

2018-06-05 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21498 **[Test build #91497 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91497/testReport)** for PR 21498 at commit