Re: Control number of tasks per stage

2014-07-07 Thread Daniel Siegmann
The default number of tasks when reading files is based on how the files are split among the nodes. Beyond that, the default number of tasks after a shuffle is based on the property spark.default.parallelism. (see http://spark.apache.org/docs/latest/configuration.html). You can use RDD.repartition

Control number of tasks per stage

2014-07-07 Thread Konstantin Kudryavtsev
Hi all, is it any way to control the number tasks per stage? currently I see situation when only 2 tasks are created per stage and each of them is very slow, at the same time cluster has a huge number of unused nodes Thank you, Konstantin Kudryavtsev