Hi, My iterative program written in Spark got quite various running time for each iterations, although the computation load is supposed to be roughly the same. My program logic would add a batch of tuples and delete roughly same number of tuples in each iteration.
I suspect part of the reason is because the partitions are not allocated evenly between the machines. Is there any easy way to fix the output location for each partition? (say, each time I create a new RDD with 32 partitions when running on 4 machines, I would like to fix the first 8 partitions to the first machine, the second 8 partitions to the second machine, etc). I just want to verify whether my assumption is correct. :) Thank you! Best Regards, WEnlei
