[GitHub] [spark] maropu commented on pull request #28123: [SPARK-31350][SQL] Coalesce bucketed tables for sort merge join if applicable

2020-07-11 Thread GitBox


maropu commented on pull request #28123:
URL: https://github.com/apache/spark/pull/28123#issuecomment-657166984


   Thanks for your interest, @c21 
   
   > (1).Is there a reason why we don't cover ShuffledHashJoin as well? (we are 
seeing in production, people also use ShuffledHashJoin a lot for joining 
bucketed tables when one side is small)
   
   As you said in (3), too, I think that's because there is a concern where 
coalescing might hurt the parallelism. You can see the related discussion in 
the history: https://github.com/apache/spark/pull/28123#discussion_r427073319
   As for (1) and (3), IMO its worth digging into it for more improvements.
   
   > (2).Per this PR, the ordering property of coalesced bucket files does not 
preserve 
(https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L317),
 and the ordering can be preserved through a sort-merge-way read of all sorted 
buckets file. This can help when reading multiple partitions of bucketed table.
   
   I think that's the long-standing issue we have. Have you checked the 
discussion in SPARK-24528? If you're interested in the issue, you can revisit 
it there.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on pull request #28123: [SPARK-31350][SQL] Coalesce bucketed tables for sort merge join if applicable

2020-06-19 Thread GitBox


maropu commented on pull request #28123:
URL: https://github.com/apache/spark/pull/28123#issuecomment-646894724


   Thanks! I appreciate your hard work, @imback82 ! Merged to master. Also, 
thanks for the reviews, @cloud-fan @viirya !



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on pull request #28123: [SPARK-31350][SQL] Coalesce bucketed tables for sort merge join if applicable

2020-06-12 Thread GitBox


maropu commented on pull request #28123:
URL: https://github.com/apache/spark/pull/28123#issuecomment-643500443


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on pull request #28123: [SPARK-31350][SQL] Coalesce bucketed tables for sort merge join if applicable

2020-06-12 Thread GitBox


maropu commented on pull request #28123:
URL: https://github.com/apache/spark/pull/28123#issuecomment-643251602


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on pull request #28123: [SPARK-31350][SQL] Coalesce bucketed tables for sort merge join if applicable

2020-06-12 Thread GitBox


maropu commented on pull request #28123:
URL: https://github.com/apache/spark/pull/28123#issuecomment-643107248


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org