I am not too familiar with Spark Standalone, so unfortunately I cannot give
you any definite answer. I do want to clarify something though.

The properties spark.sql.shuffle.partitions and spark.default.parallelism
affect how your data is split up, which will determine the *total* number
of tasks, *NOT* the number of tasks being run in parallel. Except of course
you will never run more tasks in parallel than there are total, so if your
data is small you might be able to control it via these parameters - but
that wouldn't typically be how you'd use these parameters.

On YARN as you noted there is spark.executor.instances as well as
spark.executor.cores, and you'd multiple them to determine the maximum
number of tasks that would run in parallel on your cluster. But there is no
guarantee the executors would be distributed evenly across nodes.

Unfortunately I'm not familiar with how this works on Spark Standalone.
Your expectations seem reasonable to me. Sorry I can't be helpful,
hopefully someone else will be able to explain exactly how this works.

Reply via email to