I guess it could be solved by extending from existing RDD and override the getPreferredLocations() definition.
But I am not sure, I will wait for the answer. On Thu, Oct 31, 2013 at 10:44 PM, Wenlei Xie <[email protected]> wrote: > Hi, > > My iterative program written in Spark got quite various running time for > each iterations, although the computation load is supposed to > be roughly the same. My program logic would add a batch of tuples and > delete roughly same number of tuples in each iteration. > > I suspect part of the reason is because the partitions are not allocated > evenly between the machines. Is there any easy way to fix the output > location for each partition? (say, each time I create a new RDD with 32 > partitions when running on 4 machines, I would like to fix the first 8 > partitions to the first machine, the second 8 partitions to the second > machine, etc). I just want to verify whether my assumption is correct. :) > > Thank you! > > Best Regards, > WEnlei > -- Dachuan Huang Cellphone: 614-390-7234 2015 Neil Avenue Ohio State University Columbus, Ohio U.S.A. 43210
