[GitHub] [spark] Karl-WangSK commented on pull request #29360: [SPARK-32542][SQL] Add an optimizer rule to split an Expand into multiple Expands for aggregates
Karl-WangSK commented on pull request #29360: URL: https://github.com/apache/spark/pull/29360#issuecomment-686560052 ready to merge if no other problems @LuciferYang Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Karl-WangSK commented on pull request #29360: [SPARK-32542][SQL] Add an optimizer rule to split an Expand into multiple Expands for aggregates
Karl-WangSK commented on pull request #29360: URL: https://github.com/apache/spark/pull/29360#issuecomment-680740923 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Karl-WangSK commented on pull request #29360: [SPARK-32542][SQL] Add an optimizer rule to split an Expand into multiple Expands for aggregates
Karl-WangSK commented on pull request #29360: URL: https://github.com/apache/spark/pull/29360#issuecomment-675801089 > > But shuffle is happened during Aggregate here, right? By splitting, the total amount of shuffled data is not changed, but split into several ones. Does it really result significant improvement? > > As @viirya said above, I think the same. Why can this reduce the amount of shuffle writes (and improve the performance)? In the case of `expand -> partial aggregates`, the aggregates seem to have the same **total** amount of output size. In my idea and according to benchmark.I think the when data size is larger than execution memory ,then it will cache in disk ,so it losses the performance and increase the time. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Karl-WangSK commented on pull request #29360: [SPARK-32542][SQL] Add an optimizer rule to split an Expand into multiple Expands for aggregates
Karl-WangSK commented on pull request #29360: URL: https://github.com/apache/spark/pull/29360#issuecomment-675515887 @cloud-fan @maropu Please take a look, thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Karl-WangSK commented on pull request #29360: [SPARK-32542][SQL] Add an optimizer rule to split an Expand into multiple Expands for aggregates
Karl-WangSK commented on pull request #29360: URL: https://github.com/apache/spark/pull/29360#issuecomment-675322813 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Karl-WangSK commented on pull request #29360: [SPARK-32542][SQL] Add an optimizer rule to split an Expand into multiple Expands for aggregates
Karl-WangSK commented on pull request #29360: URL: https://github.com/apache/spark/pull/29360#issuecomment-675202769 @cloud-fan This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Karl-WangSK commented on pull request #29360: [SPARK-32542][SQL] Add an optimizer rule to split an Expand into multiple Expands for aggregates
Karl-WangSK commented on pull request #29360: URL: https://github.com/apache/spark/pull/29360#issuecomment-674625909 hi. Can anyone deal with this pr? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Karl-WangSK commented on pull request #29360: [SPARK-32542][SQL] Add an optimizer rule to split an Expand into multiple Expands for aggregates
Karl-WangSK commented on pull request #29360: URL: https://github.com/apache/spark/pull/29360#issuecomment-673868646 yes.The shuffle output is the same, because the size of the data is the same. As you can see the benchmark: cube 7 fields k1, k2, k3, k4, k5, k6, k7(128x projections) and cube 6 fields k1, k2, k3, k4, k5, k6(64x projections) with grouping off data size is double ,but the time ,one is 2.4min ,the another one is 8.7min, not just double time .It will be affected by data size Especially when the memory is limited. The original data I created is about 20M, executor memory is 1g. when it expands to 64x or 128x. It will have big impact on shuffle performance. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org