Hey all,

I know I can control the number of partitions to be used during Dataset
aggregations (groupBy, groupByKey, distinct, ...) by
spark.sql.shuffle.partitions configuraiton.

Is there any specific reason why Dataset api does not support passing
number of partitions explicitly to every call of relevant function
similarly as RDDs are doing that? This way it would much more flexible. If
there are more aggregations in an app (quite common) and they differ in how
computationally expensive they are (quite common) then we are forced to set
bigger #partitions for trivial aggregations as well.

Any thoughts or pointers to relevant design documents appreciated...

Thanks!

Jakub Dubovsky

Reply via email to