Thanks, Jack!

Please let me know if you find any other guides specific to tuning shuffles
and joins.

Currently, the best way I know how to handle joins across large datasets
that can't be broadcast is by rewriting the source tables HIVE partitioned
by one or two join keys, and then breaking down the joins into stages with
intermediate write steps for a join across a handful of tables with a new
HIVE partition scheme that suits the next set of joins. I perform these
joins in a python loop over the HIVE partitions to minimize load.

I imagine there's more I could do to reduce the amount of manual coding and
intermediate write steps.

I'll start with these docs!

Thanks,

Bryant

On Tue, Nov 28, 2023, 5:23 PM Jack Goodson <jackagood...@gmail.com> wrote:

> Hi Bryant,
>
> the below docs are a good start on performance tuning
>
> https://spark.apache.org/docs/latest/sql-performance-tuning.html
>
> Hope it helps!
>
> On Wed, Nov 29, 2023 at 9:32 AM Bryant Wright <wrightcod...@gmail.com>
> wrote:
>
>> Hi, I'm looking for a comprehensive list of Tuning Best Practices for
>> spark.
>>
>> I did a search on the archives for "tuning" and the search returned no
>> results.
>>
>> Thanks for your help.
>>
>

Reply via email to