In our case, we'd like to keep memory content from one iteration to the next, and not just during a single mapPartition call because then we can do more efficient computations using the values from the previous iteration.
So essentially, we need to declare objects outside the scope of the map/reduce calls (but residing in individual workers), then those can be accessed from the map/reduce calls. We'd be making some assumptions as you said, such as - RDD partition is statically located and can't move from worker to another worker unless the worker crashes. On Mon, Apr 28, 2014 at 1:35 AM, Sean Owen <so...@cloudera.com> wrote: > On Mon, Apr 28, 2014 at 9:30 AM, Sung Hwan Chung <coded...@cs.stanford.edu > > wrote: > >> Actually, I do not know how to do something like this or whether this is >> possible - thus my suggestive statement. >> >> Can you already declare persistent memory objects per worker? I tried >> something like constructing a singleton object within map functions, but >> that didn't work as it seemed to actually serialize singletons and pass it >> back and forth in a weird manner. >> >> > Does it need to be persistent across operations, or just persist for the > lifetime of processing of one partition in one mapPartition? The latter is > quite easy and might give most of the speedup. > > Maybe that's 'enough', even if it means you re-cache values several times > in a repeated iterative computation. It would certainly avoid managing a > lot of complexity in trying to keep that state alive remotely across > operations. I'd also be interested if there is any reliable way to do that, > though it seems hard since it means you embed assumptions about where > particular data is going to be processed. > >