Re: Do developers have to be aware of Spark's fault tolerance mechanism?

Marcelo Vanzin Mon, 21 Apr 2014 11:16:42 -0700

Hi Sung,

On Mon, Apr 21, 2014 at 10:52 AM, Sung Hwan Chung
<coded...@cs.stanford.edu> wrote:
> The goal is to keep an intermediate value per row in memory, which would
> allow faster subsequent computations. I.e., computeSomething would depend on
> the previous value from the previous computation.


I think the fundamental problem here is that there is no "in memory
state" of the sort you mention when you're talking about
map/reduce-style workloads. There are three kinds of data that you can
use to communicate between sub-tasks:

- RDD input / output, i.e. the arguments and return values of the
closures you pass to transformations
- Broadcast variables
- Accumulators

In general, distributed algorithms should strive to be stateless,
exactly because of issues like reliability and having to re-run
computations (and communication/coordination in general being
expensive). The last two in the list above are not generally targeted
at the kind of state-keeping that you seem to be talking about.

So if you make the result of "computeSomething()" the output of your
map task, then you'll have access to it in the operations downstream
from that map task. But you can't "store it in a variable in memory"
and access it later, because that's not how the system works.

In any case, I'm really not familiar with ML algorithms, but maybe you
should take a look at MLLib.


-- 
Marcelo

Re: Do developers have to be aware of Spark's fault tolerance mechanism?

Reply via email to