While "apparently" saturating the N available workers using your proposed N
partitions - the "actual" distribution of workers to tasks is controlled by
the scheduler. If my past experience were of service - you can *not *trust
the default Fair Scheduler to ensure the round-robin scheduling of the
Dear community,
I have a RDD with N rows and N partitions. I want to ensure that the
partitions run all at the some time, by setting the number of vcores
(spark-yarn) to N. The partitions need to talk to each other with some
socket based sync that is why I need them to run more or less
simultaneou