[GitHub] [spark] bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join is not optimized away
bmarcott commented on issue #27096: URL: https://github.com/apache/spark/pull/27096#issuecomment-618200415 Fixed latest conflicts/rebased on latest master @cloud-fan could you take a look when you get a chance? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join is not optimized away
bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join is not optimized away URL: https://github.com/apache/spark/pull/27096#issuecomment-583955409 @cloud-fan @viirya bump This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join is not optimized away
bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join is not optimized away URL: https://github.com/apache/spark/pull/27096#issuecomment-579068832 mind taking another look? @cloud-fan @viirya This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join is not optimized away
bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join is not optimized away URL: https://github.com/apache/spark/pull/27096#issuecomment-577482952 @cloud-fan and/or others please review when convenient This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join is not optimized away
bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join is not optimized away URL: https://github.com/apache/spark/pull/27096#issuecomment-576978084 any remaining comments? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join is not optimized away
bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join is not optimized away URL: https://github.com/apache/spark/pull/27096#issuecomment-576830723 rebased This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join is not optimized away
bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join is not optimized away URL: https://github.com/apache/spark/pull/27096#issuecomment-575968635 do the tests run post merge with master, or do I need to rebase to pick up fix #27242? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join is not optimized away
bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join is not optimized away URL: https://github.com/apache/spark/pull/27096#issuecomment-575965874 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join is not optimized away
bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join is not optimized away URL: https://github.com/apache/spark/pull/27096#issuecomment-575001493 updated with a new sparkplan rule and added a test which makes sure a user's repartition with a different numPartitions would not be eliminated (don't want to change expected numPartitions). Review when you get a chance :). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join is not optimized away
bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join is not optimized away URL: https://github.com/apache/spark/pull/27096#issuecomment-574475407 I started to try including #26946, but it feels a little messy. Notice that the numPartitions of the previous shuffle are maintained, but the distribution/partitioning is changed: `case (ShuffleExchangeExec(partitioning, child, _), distribution: OrderedDistribution) => ShuffleExchangeExec(distribution.createPartitioning(partitioning.numPartitions), child)` If we try to remove this code from EnsureRequirements, it'll create a new shuffle node with `defaultNumPreShufflePartitions`. Does it make sense to have a general rule that if we have a shuffle with range partitioning with another shuffle as child, we eliminate the child but reuse it's numpartitions? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join is not optimized away
bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join is not optimized away URL: https://github.com/apache/spark/pull/27096#issuecomment-574005977 i'll update soon with the new rule `PruneShuffleAndSort` I'm iffy on including [PR-26946](https://github.com/apache/spark/pull/26946). `EnsureRequirements` should probably only introduce shuffles/ordering if it needs to, rather than adding blindly and relying on later rule to prune. Or maybe it is better to include so it'll apply to any case (haven't thought deeply to the cases here)? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join is not optimized away
bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join is not optimized away URL: https://github.com/apache/spark/pull/27096#issuecomment-573523635 Any more suggestions? I'm starting to think the new physical node is a bit of a hack as well just to reuse the code in `ensureDistributionAndOrdering`. We wouldn't want this "no op" SparkPlan to show up in the plan right? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join is not optimized away
bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join is not optimized away URL: https://github.com/apache/spark/pull/27096#issuecomment-572408308 Thanks for taking a look! Yes, the reason it is here is because the shuffle/sorting is introduced by EnsureRequirements itself, which causes the user added sorts/shuffles unnecessary. Yea it felt a little hacky for optimization code to be in a rule called EnsureRequirements. I'd like someone more familiar with overall planner design to suggest whether I go through with 1st or 2nd option. For 2nd option, won't I need to create a new physical node for both the repartition and sort, each of which is kinda a dummy physical node which relies on EnsureRequirements to add the necessary sorts/partitioning based on `requiredChildDistribution` and `requiredChildOrdering` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org