Hi I can't seem to find a way to pass number of partitions while join 2 Datasets or doing a groupBy operation on the Dataset. There is an option of repartitioning the resultant Dataset but it's inefficient to repartition after the Dataset has been joined/grouped into default number of partitions. With RDD API, this was easy to do as the functions accepted a numPartitions parameter. The only way to do this seems to be sparkSession.conf.set(SQLConf.SHUFFLE_PARTITIONS.key, <num partitions>) but this means that all join/groupBy operations going forward will have the same number of partitions.
Thanks, Aniket