Curious – why do you want to repartition? Is there a subsequent step which fails because the number of partitions is less? Or you want to do it for a perf gain?
Also, what were your initial Dataset partitions and how many did you have for the result of join? From: Aniket Bhatnagar [mailto:aniket.bhatna...@gmail.com] Sent: Friday, November 11, 2016 9:22 AM To: user <user@spark.apache.org> Subject: Dataset API | Setting number of partitions during join/groupBy Hi I can't seem to find a way to pass number of partitions while join 2 Datasets or doing a groupBy operation on the Dataset. There is an option of repartitioning the resultant Dataset but it's inefficient to repartition after the Dataset has been joined/grouped into default number of partitions. With RDD API, this was easy to do as the functions accepted a numPartitions parameter. The only way to do this seems to be sparkSession.conf.set(SQLConf.SHUFFLE_PARTITIONS.key, <num partitions>) but this means that all join/groupBy operations going forward will have the same number of partitions. Thanks, Aniket