Re: restart from last successful stage

2015-07-29 Thread Alex Nastetsky
I meant a restart by the user, as ayan said.

I was thinking of a case where e.g. a Spark conf setting wrong and the job
failed in Stage 1, in my example .. and we want to rerun the job with the
right conf without rerunning Stage 0. Having this re-start capability may
cause some chaos if it would have changed how Stage 0 runs, possibly
creating partition incompatibilities or something else.

Also, another option is to just persist the data from Stage 0 (i.e.
sc.saveAs) and then run a modified version of the job that skips Stage
0, assuming you have a full understanding of the breakdown of stages in
your job.

On Tue, Jul 28, 2015 at 9:28 PM, Tathagata Das t...@databricks.com wrote:

 Okay, may I am confused on the word would be useful to *restart* from the
 output of stage 0 ... did the OP mean restart by the user or restart
 automatically by the system?

 On Tue, Jul 28, 2015 at 3:43 PM, ayan guha guha.a...@gmail.com wrote:

 Hi

 I do not think op asks about attempt failure but stage failure and
 finally leading to job failure. In that case, rdd info from last run is
 gone even if from cache, isn't it?

 Ayan
 On 29 Jul 2015 07:01, Tathagata Das t...@databricks.com wrote:

 If you are using the same RDDs in the both the attempts to run the job,
 the previous stage outputs generated in the previous job will indeed be
 reused.
 This applies to core though. For dataframes, depending on what you do,
 the physical plan may get generated again leading to new RDDs which may
 cause recomputing all the stages. Consider running the job by generating
 the RDD from Dataframe and then using that.

 Of course, you can use caching in both core and DataFrames, which will
 solve all these concerns.

 On Tue, Jul 28, 2015 at 1:03 PM, Alex Nastetsky 
 alex.nastet...@vervemobile.com wrote:

 Is it possible to restart the job from the last successful stage
 instead of from the beginning?

 For example, if your job has stages 0, 1 and 2 .. and stage 0 takes a
 long time and is successful, but the job fails on stage 1, it would be
 useful to be able to restart from the output of stage 0 instead of from the
 beginning.

 Note that I am NOT talking about Spark Streaming, just Spark Core (and
 DataFrames), not sure if the case would be different with Streaming.

 Thanks.






Re: restart from last successful stage

2015-07-29 Thread Tathagata Das
If you are changing the SparkConf, that mean you have recreate the
SparkContext, isnt it? So you have to stop the previous SparkCotnext which
deletes all the information about stages that have been run. So the better
approach is to indeed save the data of the last stage explicitly and then
try rerunning with updated conf/context.

On Wed, Jul 29, 2015 at 8:28 AM, Alex Nastetsky 
alex.nastet...@vervemobile.com wrote:

 I meant a restart by the user, as ayan said.

 I was thinking of a case where e.g. a Spark conf setting wrong and the job
 failed in Stage 1, in my example .. and we want to rerun the job with the
 right conf without rerunning Stage 0. Having this re-start capability may
 cause some chaos if it would have changed how Stage 0 runs, possibly
 creating partition incompatibilities or something else.

 Also, another option is to just persist the data from Stage 0 (i.e.
 sc.saveAs) and then run a modified version of the job that skips Stage
 0, assuming you have a full understanding of the breakdown of stages in
 your job.

 On Tue, Jul 28, 2015 at 9:28 PM, Tathagata Das t...@databricks.com
 wrote:

 Okay, may I am confused on the word would be useful to *restart* from
 the output of stage 0 ... did the OP mean restart by the user or restart
 automatically by the system?

 On Tue, Jul 28, 2015 at 3:43 PM, ayan guha guha.a...@gmail.com wrote:

 Hi

 I do not think op asks about attempt failure but stage failure and
 finally leading to job failure. In that case, rdd info from last run is
 gone even if from cache, isn't it?

 Ayan
 On 29 Jul 2015 07:01, Tathagata Das t...@databricks.com wrote:

 If you are using the same RDDs in the both the attempts to run the job,
 the previous stage outputs generated in the previous job will indeed be
 reused.
 This applies to core though. For dataframes, depending on what you do,
 the physical plan may get generated again leading to new RDDs which may
 cause recomputing all the stages. Consider running the job by generating
 the RDD from Dataframe and then using that.

 Of course, you can use caching in both core and DataFrames, which will
 solve all these concerns.

 On Tue, Jul 28, 2015 at 1:03 PM, Alex Nastetsky 
 alex.nastet...@vervemobile.com wrote:

 Is it possible to restart the job from the last successful stage
 instead of from the beginning?

 For example, if your job has stages 0, 1 and 2 .. and stage 0 takes a
 long time and is successful, but the job fails on stage 1, it would be
 useful to be able to restart from the output of stage 0 instead of from 
 the
 beginning.

 Note that I am NOT talking about Spark Streaming, just Spark Core (and
 DataFrames), not sure if the case would be different with Streaming.

 Thanks.







Re: restart from last successful stage

2015-07-28 Thread ayan guha
Hi

I do not think op asks about attempt failure but stage failure and finally
leading to job failure. In that case, rdd info from last run is gone even
if from cache, isn't it?

Ayan
On 29 Jul 2015 07:01, Tathagata Das t...@databricks.com wrote:

 If you are using the same RDDs in the both the attempts to run the job,
 the previous stage outputs generated in the previous job will indeed be
 reused.
 This applies to core though. For dataframes, depending on what you do, the
 physical plan may get generated again leading to new RDDs which may cause
 recomputing all the stages. Consider running the job by generating the RDD
 from Dataframe and then using that.

 Of course, you can use caching in both core and DataFrames, which will
 solve all these concerns.

 On Tue, Jul 28, 2015 at 1:03 PM, Alex Nastetsky 
 alex.nastet...@vervemobile.com wrote:

 Is it possible to restart the job from the last successful stage instead
 of from the beginning?

 For example, if your job has stages 0, 1 and 2 .. and stage 0 takes a
 long time and is successful, but the job fails on stage 1, it would be
 useful to be able to restart from the output of stage 0 instead of from the
 beginning.

 Note that I am NOT talking about Spark Streaming, just Spark Core (and
 DataFrames), not sure if the case would be different with Streaming.

 Thanks.





Re: restart from last successful stage

2015-07-28 Thread Tathagata Das
If you are using the same RDDs in the both the attempts to run the job, the
previous stage outputs generated in the previous job will indeed be reused.
This applies to core though. For dataframes, depending on what you do, the
physical plan may get generated again leading to new RDDs which may cause
recomputing all the stages. Consider running the job by generating the RDD
from Dataframe and then using that.

Of course, you can use caching in both core and DataFrames, which will
solve all these concerns.

On Tue, Jul 28, 2015 at 1:03 PM, Alex Nastetsky 
alex.nastet...@vervemobile.com wrote:

 Is it possible to restart the job from the last successful stage instead
 of from the beginning?

 For example, if your job has stages 0, 1 and 2 .. and stage 0 takes a long
 time and is successful, but the job fails on stage 1, it would be useful to
 be able to restart from the output of stage 0 instead of from the beginning.

 Note that I am NOT talking about Spark Streaming, just Spark Core (and
 DataFrames), not sure if the case would be different with Streaming.

 Thanks.