Have you tried

partitioning df1, df2 on key1,
join them
Partition df3 and result above on key2
Join again

That’s the strategy I use and it scales well for me. For reference check
getBlocks in

https://github.com/zinggAI/zingg/blob/main/core/src/main/java/zingg/Matcher.java


On Tue, 5 Oct 2021 at 3:05 PM, Saurabh Gulati
<saurabh.gul...@fedex.com.invalid> wrote:

> Hi Amit,
> The only approach I can think of is to create 2 copies of schema_df1​,
> one partitioned on key1 and other on key2 and then use these to Join.
> ------------------------------
> *From:* Amit Joshi <mailtojoshia...@gmail.com>
> *Sent:* 04 October 2021 19:13
> *To:* spark-user <user@spark.apache.org>
> *Subject:* [EXTERNAL] [Marketing Mail] Re: [Spark] Optimize spark join on
> different keys for same data frame
>
> *Caution! This email originated outside of FedEx. Please do not open
> attachments or click links from an unknown or suspicious origin*.
> Hi spark users,
>
> Can anyone please provide any views on the topic.
>
>
> Regards
> Amit Joshi
>
> On Sunday, October 3, 2021, Amit Joshi <mailtojoshia...@gmail.com> wrote:
>
> Hi Spark-Users,
>
> Hope you are doing good.
>
> I have been working on cases where a dataframe is joined with more than
> one data frame separately, on different cols, that too frequently.
> I was wondering how to optimize the join to make them faster.
> We can consider the dataset to be big in size so broadcast joins is not an
> option.
>
> For eg:
>
> schema_df1  = new StructType()
> .add(StructField("key1", StringType, true))
> .add(StructField("key2", StringType, true))
> .add(StructField("val", DoubleType, true))
>
>
> schema_df2  = new StructType()
> .add(StructField("key1", StringType, true))
> .add(StructField("val", DoubleType, true))
>
>
> schema_df3  = new StructType()
> .add(StructField("key2", StringType, true))
> .add(StructField("val", DoubleType, true))
>
> Now if we want to join
> join1 =  df1.join(df2,"key1")
> join2 =  df1.join(df3,"key2")
>
> I was thinking of bucketing as a solution to speed up the joins. But if I
> bucket df1 on the key1,then join2  may not benefit, and vice versa (if
> bucket on key2 for df1).
>
> or Should we bucket df1 twice, one with key1 and another with key2?
> Is there a strategy to make both the joins faster for both the joins?
>
>
> Regards
> Amit Joshi
>
>
>
> --
Cheers,
Sonal
https://github.com/zinggAI/zingg

Reply via email to