The default number of tasks when reading files is based on how the files
are split among the nodes. Beyond that, the default number of tasks after a
shuffle is based on the property spark.default.parallelism. (see
http://spark.apache.org/docs/latest/configuration.html).
You can use RDD.repartition
Hi all,
is it any way to control the number tasks per stage?
currently I see situation when only 2 tasks are created per stage and each
of them is very slow, at the same time cluster has a huge number of unused
nodes
Thank you,
Konstantin Kudryavtsev