[GitHub] [spark] bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join is not optimized away

2020-04-23 Thread GitBox


bmarcott commented on issue #27096:
URL: https://github.com/apache/spark/pull/27096#issuecomment-618200415


   Fixed latest conflicts/rebased on latest master
   
   @cloud-fan  could you take a look when you get a chance?
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join is not optimized away

2020-02-09 Thread GitBox
bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join 
is not optimized away
URL: https://github.com/apache/spark/pull/27096#issuecomment-583955409
 
 
   @cloud-fan @viirya 
   bump  


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join is not optimized away

2020-01-27 Thread GitBox
bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join 
is not optimized away
URL: https://github.com/apache/spark/pull/27096#issuecomment-579068832
 
 
   mind taking another look?
   @cloud-fan 
   @viirya 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join is not optimized away

2020-01-22 Thread GitBox
bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join 
is not optimized away
URL: https://github.com/apache/spark/pull/27096#issuecomment-577482952
 
 
   @cloud-fan and/or others please review when convenient


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join is not optimized away

2020-01-21 Thread GitBox
bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join 
is not optimized away
URL: https://github.com/apache/spark/pull/27096#issuecomment-576978084
 
 
   any remaining comments?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join is not optimized away

2020-01-21 Thread GitBox
bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join 
is not optimized away
URL: https://github.com/apache/spark/pull/27096#issuecomment-576830723
 
 
   rebased


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join is not optimized away

2020-01-18 Thread GitBox
bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join 
is not optimized away
URL: https://github.com/apache/spark/pull/27096#issuecomment-575968635
 
 
   do the tests run post merge with master, or do I need to rebase to pick up 
fix  #27242?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join is not optimized away

2020-01-18 Thread GitBox
bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join 
is not optimized away
URL: https://github.com/apache/spark/pull/27096#issuecomment-575965874
 
 
   retest this please


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join is not optimized away

2020-01-15 Thread GitBox
bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join 
is not optimized away
URL: https://github.com/apache/spark/pull/27096#issuecomment-575001493
 
 
   updated with a new sparkplan rule and added a test which makes sure a user's 
repartition with a different numPartitions would not be eliminated (don't want 
to change expected numPartitions).
   
   Review when you get a chance :).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join is not optimized away

2020-01-14 Thread GitBox
bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join 
is not optimized away
URL: https://github.com/apache/spark/pull/27096#issuecomment-574475407
 
 
   I started to try including #26946, but it feels a little messy.
   
   Notice that the numPartitions of the previous shuffle are maintained, but 
the distribution/partitioning is changed:
   `case (ShuffleExchangeExec(partitioning, child, _), distribution: 
OrderedDistribution) =>
   
ShuffleExchangeExec(distribution.createPartitioning(partitioning.numPartitions),
 child)`
   
   If we try to remove this code from EnsureRequirements, it'll create a new 
shuffle node with `defaultNumPreShufflePartitions`. Does it make sense to have 
a general rule that if we have a shuffle with range partitioning with another 
shuffle as child, we eliminate the child but reuse it's numpartitions?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join is not optimized away

2020-01-13 Thread GitBox
bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join 
is not optimized away
URL: https://github.com/apache/spark/pull/27096#issuecomment-574005977
 
 
   i'll update soon with the new rule `PruneShuffleAndSort`
   I'm iffy on including [PR-26946](https://github.com/apache/spark/pull/26946).
   `EnsureRequirements` should probably only introduce shuffles/ordering if it 
needs to, rather than adding blindly and relying on later rule to prune. Or 
maybe it is better to include so it'll apply to any case (haven't thought 
deeply to the cases here)?
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join is not optimized away

2020-01-12 Thread GitBox
bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join 
is not optimized away
URL: https://github.com/apache/spark/pull/27096#issuecomment-573523635
 
 
   Any more suggestions?
   I'm starting to think the new physical node is a bit of a hack as well just 
to reuse the code in `ensureDistributionAndOrdering`. We wouldn't want this "no 
op" SparkPlan to show up in the plan right?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join is not optimized away

2020-01-08 Thread GitBox
bmarcott commented on issue #27096: [SPARK-28148][SQL] Repartition after join 
is not optimized away
URL: https://github.com/apache/spark/pull/27096#issuecomment-572408308
 
 
   Thanks for taking a look!
   Yes, the reason it is here is because the shuffle/sorting is introduced by 
EnsureRequirements itself, which causes the user added sorts/shuffles 
unnecessary. Yea it felt a little hacky for optimization code to be in a rule 
called EnsureRequirements.
   
   I'd like someone more familiar with overall planner design to suggest 
whether I go through with 1st or 2nd option.
   For 2nd option, won't I need to create a new physical node for both the 
repartition and sort, each of which is kinda a dummy physical node which relies 
on EnsureRequirements to add the necessary sorts/partitioning based on 
`requiredChildDistribution` and `requiredChildOrdering`
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org