Hi Amit,
The only approach I can think of is to create 2 copies of schema_df1​, one 
partitioned on key1 and other on key2 and then use these to Join.
________________________________
From: Amit Joshi <mailtojoshia...@gmail.com>
Sent: 04 October 2021 19:13
To: spark-user <user@spark.apache.org>
Subject: [EXTERNAL] [Marketing Mail] Re: [Spark] Optimize spark join on 
different keys for same data frame

Caution! This email originated outside of FedEx. Please do not open attachments 
or click links from an unknown or suspicious origin.

Hi spark users,

Can anyone please provide any views on the topic.


Regards
Amit Joshi

On Sunday, October 3, 2021, Amit Joshi 
<mailtojoshia...@gmail.com<mailto:mailtojoshia...@gmail.com>> wrote:
Hi Spark-Users,

Hope you are doing good.

I have been working on cases where a dataframe is joined with more than one 
data frame separately, on different cols, that too frequently.
I was wondering how to optimize the join to make them faster.
We can consider the dataset to be big in size so broadcast joins is not an 
option.

For eg:

schema_df1  = new StructType()
.add(StructField("key1", StringType, true))
.add(StructField("key2", StringType, true))
.add(StructField("val", DoubleType, true))


schema_df2  = new StructType()
.add(StructField("key1", StringType, true))
.add(StructField("val", DoubleType, true))


schema_df3  = new StructType()
.add(StructField("key2", StringType, true))
.add(StructField("val", DoubleType, true))

Now if we want to join
join1 =  df1.join(df2,"key1")
join2 =  df1.join(df3,"key2")

I was thinking of bucketing as a solution to speed up the joins. But if I 
bucket df1 on the key1,then join2  may not benefit, and vice versa (if bucket 
on key2 for df1).

or Should we bucket df1 twice, one with key1 and another with key2?
Is there a strategy to make both the joins faster for both the joins?


Regards
Amit Joshi



Reply via email to