Hello there,

I may be having a naive question on join / groupBy-agg. During the days of
RDD, whenever I wanted to perform
a. groupBy-agg, I used to say reduceByKey (of PairRDDFunctions) with an
optional Partition-Strategy (with is number of partitions or Partitioner)
b. join (of PairRDDFunctions) and its variants, I used to have a way to
provide number of partitions

In DataFrame, how do I specify the number of partitions during this
operation? I could use repartition() after the fact. But this would be
another Stage in the Job.

One work around to increase the number of partitions / task during a join
is to set 'spark.sql.shuffle.partitions' it some desired number during
spark-submit. I am trying to see if there is a way to provide this
programmatically for every step of a groupBy-agg / join.

Please advice,
Muthu

Reply via email to