A mutable map in an object should do what your looking for then I believe.
You just reference the object as an object in your closure so it won't be
swept up when your closure is serialized and you can reference variables of
the object on the remote host then. e.g.:

object MyObject {
  val mmap = scala.collection.mutable.Map[Long, Long]()
}

rdd.map { ele =>
MyObject.mmap.getOrElseUpdate(ele, 1L)
...
}.map {ele =>
require(MyObject.mmap(ele) == 1L)

}.count

Along with the data loss just be careful with thread safety and multiple
threads/partitions on one host so the map should be viewed as shared
amongst a larger space.



Also with your exact description it sounds like your data should be encoded
into the RDD if its per-record/per-row:  RDD[(MyBaseData,
LastIterationSideValues)]



On Mon, Apr 28, 2014 at 1:51 AM, Sung Hwan Chung
<coded...@cs.stanford.edu>wrote:

> In our case, we'd like to keep memory content from one iteration to the
> next, and not just during a single mapPartition call because then we can do
> more efficient computations using the values from the previous iteration.
>
> So essentially, we need to declare objects outside the scope of the
> map/reduce calls (but residing in individual workers), then those can be
> accessed from the map/reduce calls.
>
> We'd be making some assumptions as you said, such as - RDD partition is
> statically located and can't move from worker to another worker unless the
> worker crashes.
>
>
>
> On Mon, Apr 28, 2014 at 1:35 AM, Sean Owen <so...@cloudera.com> wrote:
>
>> On Mon, Apr 28, 2014 at 9:30 AM, Sung Hwan Chung <
>> coded...@cs.stanford.edu> wrote:
>>
>>> Actually, I do not know how to do something like this or whether this is
>>> possible - thus my suggestive statement.
>>>
>>> Can you already declare persistent memory objects per worker? I tried
>>> something like constructing a singleton object within map functions, but
>>> that didn't work as it seemed to actually serialize singletons and pass it
>>> back and forth in a weird manner.
>>>
>>>
>> Does it need to be persistent across operations, or just persist for the
>> lifetime of processing of one partition in one mapPartition? The latter is
>> quite easy and might give most of the speedup.
>>
>> Maybe that's 'enough', even if it means you re-cache values several times
>> in a repeated iterative computation. It would certainly avoid managing a
>> lot of complexity in trying to keep that state alive remotely across
>> operations. I'd also be interested if there is any reliable way to do that,
>> though it seems hard since it means you embed assumptions about where
>> particular data is going to be processed.
>>
>>
>

Reply via email to