Re: What's the benifit of RDD checkpoint against RDD save

Ted Yu Wed, 23 Mar 2016 19:40:05 -0700

bq. when I get the last RDD
If I read Todd's first email correctly, the computation has been done.
I could be wrong.


On Wed, Mar 23, 2016 at 7:34 PM, Mark Hamstra <[email protected]>
wrote:

> Neither of you is making any sense to me.  If you just have an RDD for
> which you have specified a series of transformations but you haven't run
> any actions, then neither checkpointing nor saving makes sense -- you
> haven't computed anything yet, you've only written out the recipe for how
> the computation should be done when it is needed.  Neither does the "called
> before any job" comment pose any restriction in this case since no jobs
> have yet been executed on the RDD.
>
> On Wed, Mar 23, 2016 at 7:18 PM, Ted Yu <[email protected]> wrote:
>
>> See the doc for checkpoint:
>>
>>    * Mark this RDD for checkpointing. It will be saved to a file inside
>> the checkpoint
>>    * directory set with `SparkContext#setCheckpointDir` and all
>> references to its parent
>>    * RDDs will be removed. *This function must be called before any job
>> has been*
>> *   * executed on this RDD*. It is strongly recommended that this RDD is
>> persisted in
>>    * memory, otherwise saving it on a file will require recomputation.
>>
>> From the above description, you should not call it at the end of
>> transformations.
>>
>> Cheers
>>
>> On Wed, Mar 23, 2016 at 7:14 PM, Todd <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> I have a long computing chain, when I get the last RDD after a series of
>>> transformation. I have two choices to do with this last RDD
>>>
>>> 1. Call checkpoint on RDD to materialize it to disk
>>> 2. Call RDD.saveXXX to save it to HDFS, and read it back for further
>>> processing
>>>
>>> I would ask which choice is better? It looks to me that is not much
>>> difference between the two choices.
>>> Thanks!
>>>
>>>
>>>
>>
>

Re: What's the benifit of RDD checkpoint against RDD save

Reply via email to