[GitHub] [spark] Karl-WangSK commented on pull request #29360: [SPARK-32542][SQL] Add an optimizer rule to split an Expand into multiple Expands for aggregates

2020-09-03 Thread GitBox


Karl-WangSK commented on pull request #29360:
URL: https://github.com/apache/spark/pull/29360#issuecomment-686560052


   ready to merge if no other problems @LuciferYang  Thanks!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Karl-WangSK commented on pull request #29360: [SPARK-32542][SQL] Add an optimizer rule to split an Expand into multiple Expands for aggregates

2020-08-26 Thread GitBox


Karl-WangSK commented on pull request #29360:
URL: https://github.com/apache/spark/pull/29360#issuecomment-680740923


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Karl-WangSK commented on pull request #29360: [SPARK-32542][SQL] Add an optimizer rule to split an Expand into multiple Expands for aggregates

2020-08-18 Thread GitBox


Karl-WangSK commented on pull request #29360:
URL: https://github.com/apache/spark/pull/29360#issuecomment-675801089


   > > But shuffle is happened during Aggregate here, right? By splitting, the 
total amount of shuffled data is not changed, but split into several ones. Does 
it really result significant improvement?
   > 
   > As @viirya said above, I think the same. Why can this reduce the amount of 
shuffle writes (and improve the performance)? In the case of `expand -> partial 
aggregates`, the aggregates seem to have the same **total** amount of output 
size.
   
   In my idea and according to benchmark.I think the when data size is larger 
than execution memory ,then it will cache in disk ,so it losses the performance 
and increase the time.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Karl-WangSK commented on pull request #29360: [SPARK-32542][SQL] Add an optimizer rule to split an Expand into multiple Expands for aggregates

2020-08-18 Thread GitBox


Karl-WangSK commented on pull request #29360:
URL: https://github.com/apache/spark/pull/29360#issuecomment-675515887


   @cloud-fan @maropu  Please take a look, thanks!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Karl-WangSK commented on pull request #29360: [SPARK-32542][SQL] Add an optimizer rule to split an Expand into multiple Expands for aggregates

2020-08-18 Thread GitBox


Karl-WangSK commented on pull request #29360:
URL: https://github.com/apache/spark/pull/29360#issuecomment-675322813


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Karl-WangSK commented on pull request #29360: [SPARK-32542][SQL] Add an optimizer rule to split an Expand into multiple Expands for aggregates

2020-08-17 Thread GitBox


Karl-WangSK commented on pull request #29360:
URL: https://github.com/apache/spark/pull/29360#issuecomment-675202769


   @cloud-fan 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Karl-WangSK commented on pull request #29360: [SPARK-32542][SQL] Add an optimizer rule to split an Expand into multiple Expands for aggregates

2020-08-16 Thread GitBox


Karl-WangSK commented on pull request #29360:
URL: https://github.com/apache/spark/pull/29360#issuecomment-674625909


   hi. Can anyone deal with this pr?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Karl-WangSK commented on pull request #29360: [SPARK-32542][SQL] Add an optimizer rule to split an Expand into multiple Expands for aggregates

2020-08-13 Thread GitBox


Karl-WangSK commented on pull request #29360:
URL: https://github.com/apache/spark/pull/29360#issuecomment-673868646


   yes.The shuffle output is the same, because the size of the data is the 
same. 
   As you can see the benchmark:
   cube 7 fields k1, k2, k3, k4, k5, k6, k7(128x projections)  and cube 6 
fields k1, k2, k3, k4, k5, k6(64x projections) with  grouping off
   data size is double ,but the time ,one is 2.4min ,the another one is 8.7min, 
not just double time .It will be affected by data size Especially when the 
memory is limited.
   The original data I created is about 20M, executor memory is 1g. when it 
expands to 64x or  128x. It will have big impact on shuffle performance.
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org