Hi Sung, On Mon, Apr 21, 2014 at 10:52 AM, Sung Hwan Chung <coded...@cs.stanford.edu> wrote: > The goal is to keep an intermediate value per row in memory, which would > allow faster subsequent computations. I.e., computeSomething would depend on > the previous value from the previous computation.
I think the fundamental problem here is that there is no "in memory state" of the sort you mention when you're talking about map/reduce-style workloads. There are three kinds of data that you can use to communicate between sub-tasks: - RDD input / output, i.e. the arguments and return values of the closures you pass to transformations - Broadcast variables - Accumulators In general, distributed algorithms should strive to be stateless, exactly because of issues like reliability and having to re-run computations (and communication/coordination in general being expensive). The last two in the list above are not generally targeted at the kind of state-keeping that you seem to be talking about. So if you make the result of "computeSomething()" the output of your map task, then you'll have access to it in the operations downstream from that map task. But you can't "store it in a variable in memory" and access it later, because that's not how the system works. In any case, I'm really not familiar with ML algorithms, but maybe you should take a look at MLLib. -- Marcelo