Hey all, I know I can control the number of partitions to be used during Dataset aggregations (groupBy, groupByKey, distinct, ...) by spark.sql.shuffle.partitions configuraiton.
Is there any specific reason why Dataset api does not support passing number of partitions explicitly to every call of relevant function similarly as RDDs are doing that? This way it would much more flexible. If there are more aggregations in an app (quite common) and they differ in how computationally expensive they are (quite common) then we are forced to set bigger #partitions for trivial aggregations as well. Any thoughts or pointers to relevant design documents appreciated... Thanks! Jakub Dubovsky