Okay, may I am confused on the word "would be useful to *restart* from the output of stage 0" ... did the OP mean restart by the user or restart automatically by the system?
On Tue, Jul 28, 2015 at 3:43 PM, ayan guha <guha.a...@gmail.com> wrote: > Hi > > I do not think op asks about attempt failure but stage failure and finally > leading to job failure. In that case, rdd info from last run is gone even > if from cache, isn't it? > > Ayan > On 29 Jul 2015 07:01, "Tathagata Das" <t...@databricks.com> wrote: > >> If you are using the same RDDs in the both the attempts to run the job, >> the previous stage outputs generated in the previous job will indeed be >> reused. >> This applies to core though. For dataframes, depending on what you do, >> the physical plan may get generated again leading to new RDDs which may >> cause recomputing all the stages. Consider running the job by generating >> the RDD from Dataframe and then using that. >> >> Of course, you can use caching in both core and DataFrames, which will >> solve all these concerns. >> >> On Tue, Jul 28, 2015 at 1:03 PM, Alex Nastetsky < >> alex.nastet...@vervemobile.com> wrote: >> >>> Is it possible to restart the job from the last successful stage instead >>> of from the beginning? >>> >>> For example, if your job has stages 0, 1 and 2 .. and stage 0 takes a >>> long time and is successful, but the job fails on stage 1, it would be >>> useful to be able to restart from the output of stage 0 instead of from the >>> beginning. >>> >>> Note that I am NOT talking about Spark Streaming, just Spark Core (and >>> DataFrames), not sure if the case would be different with Streaming. >>> >>> Thanks. >>> >> >>