Actually, I do not know how to do something like this or whether this is possible - thus my suggestive statement.
Can you already declare persistent memory objects per worker? I tried something like constructing a singleton object within map functions, but that didn't work as it seemed to actually serialize singletons and pass it back and forth in a weird manner. On Mon, Apr 28, 2014 at 1:23 AM, Sean Owen <so...@cloudera.com> wrote: > On Mon, Apr 28, 2014 at 8:22 AM, Sung Hwan Chung <coded...@cs.stanford.edu > > wrote: >> >> e.g. something like >> >> rdd.mapPartition((rows : Iterator[String]) => { >> var idx = 0 >> rows.map((row: String) => { >> val valueMap = SparkWorker.getMemoryContent("valMap") >> val prevVal = valueMap(idx) >> idx += 1 >> ... >> }) >> ... >> }) >> >> The developer can implement their own fault recovery mechanism if the >> worker has crashed and lost the memory content. >> > > Yea you can always just declare your own per-partition data structures in > a function block like that, right? valueMap can be initialized to an empty > map, loaded from somewhere, or even a value that is broadcast from the > driver. > > That's certainly better than tacking data onto RDDs. > > It's not restored if the computation is lost of course, but in this and > many other cases, it's fine, as it is just for some cached intermediate > results. > > This already works then or did I misunderstand the original use case? >