Spark join: grouping of records having same value for a particular column in same partition

ARAVIND ARUMUGHAM SETHURATHNAM Wed, 26 Feb 2020 13:55:28 -0800

Hi,
We have 2 Hive tables which are read in spark and joined using a join key,  
let’s call it user_id.
Then, we write this joined dataset to S3 and register it hive as a 3rd table 
for subsequent tasks to use this joined dataset.
One of the other columns in the joined dataset is called keychain_id.


We want to group all the user records belonging to the same keychain_id in the 
same partition for a reason to avoid shuffles later.
So, can I do a repartition(“keychain_id”) before writing to s3 and registering 
it in Hive , and when I read the same data back from this third table will it 
still have the same partition grouping (all users belonging to the
Same keychain_id in the same partition)? Because trying to avoid doing a     
repartition(“keychain_id”) every time when reading from this 3rd table.
Can you please clarify ?   If there is no guarantee that it will retain the 
same partition grouping while reading, then is there another efficient way this 
can be done other than caching?
Regards,
Aravind

Spark join: grouping of records having same value for a particular column in same partition

Reply via email to