Then every worker would have to hold the whole RDD in memory. That's
got some significant drawbacks. As long as you are able to execute all
tasks locally to their partition, any additional copies of the data
don't help locality. And you need far less than N copies of the data
for that in general.
Yes. Effectively, could it avoid network transfers? Or put differently, would
an option like persist(MEMORY_ALL) improve job speed by caching an RDD on every
worker?
> On 25.02.2015, at 11:42, Sean Owen wrote:
>
> If you mean, can both copies of the blocks be used for computations?
> yes they
If you mean, can both copies of the blocks be used for computations?
yes they can.
On Wed, Feb 25, 2015 at 10:36 AM, Marius Soutier wrote:
> Hi,
>
> just a quick question about calling persist with the _2 option. Is the 2x
> replication only useful for fault tolerance, or will it also increase j