Thanks, Jack! Please let me know if you find any other guides specific to tuning shuffles and joins.
Currently, the best way I know how to handle joins across large datasets that can't be broadcast is by rewriting the source tables HIVE partitioned by one or two join keys, and then breaking down the joins into stages with intermediate write steps for a join across a handful of tables with a new HIVE partition scheme that suits the next set of joins. I perform these joins in a python loop over the HIVE partitions to minimize load. I imagine there's more I could do to reduce the amount of manual coding and intermediate write steps. I'll start with these docs! Thanks, Bryant On Tue, Nov 28, 2023, 5:23 PM Jack Goodson <jackagood...@gmail.com> wrote: > Hi Bryant, > > the below docs are a good start on performance tuning > > https://spark.apache.org/docs/latest/sql-performance-tuning.html > > Hope it helps! > > On Wed, Nov 29, 2023 at 9:32 AM Bryant Wright <wrightcod...@gmail.com> > wrote: > >> Hi, I'm looking for a comprehensive list of Tuning Best Practices for >> spark. >> >> I did a search on the archives for "tuning" and the search returned no >> results. >> >> Thanks for your help. >> >