Curious – why do you want to repartition? Is there a subsequent step which 
fails because the number of partitions is less? Or you want to do it for a perf 
gain?

Also, what were your initial Dataset partitions and how many did you have for 
the result of join?

From: Aniket Bhatnagar [mailto:aniket.bhatna...@gmail.com]
Sent: Friday, November 11, 2016 9:22 AM
To: user <user@spark.apache.org>
Subject: Dataset API | Setting number of partitions during join/groupBy

Hi

I can't seem to find a way to pass number of partitions while join 2 Datasets 
or doing a groupBy operation on the Dataset. There is an option of 
repartitioning the resultant Dataset but it's inefficient to repartition after 
the Dataset has been joined/grouped into default number of partitions. With RDD 
API, this was easy to do as the functions accepted a numPartitions parameter. 
The only way to do this seems to be 
sparkSession.conf.set(SQLConf.SHUFFLE_PARTITIONS.key, <num partitions>) but 
this means that all join/groupBy operations going forward will have the same 
number of partitions.

Thanks,
Aniket

Reply via email to