Okay, may I am confused on the word "would be useful to *restart* from the
output of stage 0" ... did the OP mean restart by the user or restart
automatically by the system?

On Tue, Jul 28, 2015 at 3:43 PM, ayan guha <guha.a...@gmail.com> wrote:

> Hi
>
> I do not think op asks about attempt failure but stage failure and finally
> leading to job failure. In that case, rdd info from last run is gone even
> if from cache, isn't it?
>
> Ayan
> On 29 Jul 2015 07:01, "Tathagata Das" <t...@databricks.com> wrote:
>
>> If you are using the same RDDs in the both the attempts to run the job,
>> the previous stage outputs generated in the previous job will indeed be
>> reused.
>> This applies to core though. For dataframes, depending on what you do,
>> the physical plan may get generated again leading to new RDDs which may
>> cause recomputing all the stages. Consider running the job by generating
>> the RDD from Dataframe and then using that.
>>
>> Of course, you can use caching in both core and DataFrames, which will
>> solve all these concerns.
>>
>> On Tue, Jul 28, 2015 at 1:03 PM, Alex Nastetsky <
>> alex.nastet...@vervemobile.com> wrote:
>>
>>> Is it possible to restart the job from the last successful stage instead
>>> of from the beginning?
>>>
>>> For example, if your job has stages 0, 1 and 2 .. and stage 0 takes a
>>> long time and is successful, but the job fails on stage 1, it would be
>>> useful to be able to restart from the output of stage 0 instead of from the
>>> beginning.
>>>
>>> Note that I am NOT talking about Spark Streaming, just Spark Core (and
>>> DataFrames), not sure if the case would be different with Streaming.
>>>
>>> Thanks.
>>>
>>
>>

Reply via email to