There is no need to do that if 1) the stage that you are concerned with
either made use of or produced MapOutputs/shuffle files; 2) reuse of those
shuffle files (which may very well be in the OS buffer cache of the worker
nodes) is sufficient for your needs; 3) the relevant Stage objects haven't
gone out of scope, which would allow the shuffle files to be removed; 4)
you reuse the exact same Stage objects that were used previously. If all
of that is true, then Spark will re-use the prior stage with performance
very similar to if you had explicitly cached an equivalent RDD.
On Mon, Oct 17, 2016 at 4:53 PM, ayan guha <guha.a...@gmail.com> wrote:
> You can use cache or persist.
> On Tue, Oct 18, 2016 at 10:11 AM, Yang <teddyyyy...@gmail.com> wrote:
>> I'm trying out 2.0, and ran a long job with 10 stages, in spark-shell
>> it seems that after all 10 finished successfully, if I run the last, or
>> the 9th again,
>> spark reruns all the previous stages from scratch, instead of utilizing
>> the partial results.
>> this is quite serious since I can't experiment while making small changes
>> to the code.
>> any idea what part of the spark framework might have caused this ?
> Best Regards,
> Ayan Guha