Have you tried partitioning df1, df2 on key1, join them Partition df3 and result above on key2 Join again
That’s the strategy I use and it scales well for me. For reference check getBlocks in https://github.com/zinggAI/zingg/blob/main/core/src/main/java/zingg/Matcher.java On Tue, 5 Oct 2021 at 3:05 PM, Saurabh Gulati <saurabh.gul...@fedex.com.invalid> wrote: > Hi Amit, > The only approach I can think of is to create 2 copies of schema_df1, > one partitioned on key1 and other on key2 and then use these to Join. > ------------------------------ > *From:* Amit Joshi <mailtojoshia...@gmail.com> > *Sent:* 04 October 2021 19:13 > *To:* spark-user <user@spark.apache.org> > *Subject:* [EXTERNAL] [Marketing Mail] Re: [Spark] Optimize spark join on > different keys for same data frame > > *Caution! This email originated outside of FedEx. Please do not open > attachments or click links from an unknown or suspicious origin*. > Hi spark users, > > Can anyone please provide any views on the topic. > > > Regards > Amit Joshi > > On Sunday, October 3, 2021, Amit Joshi <mailtojoshia...@gmail.com> wrote: > > Hi Spark-Users, > > Hope you are doing good. > > I have been working on cases where a dataframe is joined with more than > one data frame separately, on different cols, that too frequently. > I was wondering how to optimize the join to make them faster. > We can consider the dataset to be big in size so broadcast joins is not an > option. > > For eg: > > schema_df1 = new StructType() > .add(StructField("key1", StringType, true)) > .add(StructField("key2", StringType, true)) > .add(StructField("val", DoubleType, true)) > > > schema_df2 = new StructType() > .add(StructField("key1", StringType, true)) > .add(StructField("val", DoubleType, true)) > > > schema_df3 = new StructType() > .add(StructField("key2", StringType, true)) > .add(StructField("val", DoubleType, true)) > > Now if we want to join > join1 = df1.join(df2,"key1") > join2 = df1.join(df3,"key2") > > I was thinking of bucketing as a solution to speed up the joins. But if I > bucket df1 on the key1,then join2 may not benefit, and vice versa (if > bucket on key2 for df1). > > or Should we bucket df1 twice, one with key1 and another with key2? > Is there a strategy to make both the joins faster for both the joins? > > > Regards > Amit Joshi > > > > -- Cheers, Sonal https://github.com/zinggAI/zingg