Re: What's the benifit of RDD checkpoint against RDD save

Ted Yu Thu, 24 Mar 2016 14:34:03 -0700

Thanks, Mark.

Since checkpoint may get cleaned up later on, it seems option #2 (saveXXX)
is viable.


On Wed, Mar 23, 2016 at 8:01 PM, Mark Hamstra <[email protected]>
wrote:

> Yes, the terminology is being used sloppily/non-standardly in this thread
> -- "the last RDD" after a series of transformation is the RDD at the
> beginning of the chain, just now with an attached chain of "to be done"
> transformations when an action is eventually run.  If the saveXXX action is
> the only action being performed on the RDD, the rest of the chain being
> purely transformations, then checkpointing instead of saving still wouldn't
> execute any action on the RDD -- it would just mark the point at which
> checkpointing should be done when an action is eventually run.
>
> On Wed, Mar 23, 2016 at 7:38 PM, Ted Yu <[email protected]> wrote:
>
>> bq. when I get the last RDD
>> If I read Todd's first email correctly, the computation has been done.
>> I could be wrong.
>>
>> On Wed, Mar 23, 2016 at 7:34 PM, Mark Hamstra <[email protected]>
>> wrote:
>>
>>> Neither of you is making any sense to me.  If you just have an RDD for
>>> which you have specified a series of transformations but you haven't run
>>> any actions, then neither checkpointing nor saving makes sense -- you
>>> haven't computed anything yet, you've only written out the recipe for how
>>> the computation should be done when it is needed.  Neither does the "called
>>> before any job" comment pose any restriction in this case since no jobs
>>> have yet been executed on the RDD.
>>>
>>> On Wed, Mar 23, 2016 at 7:18 PM, Ted Yu <[email protected]> wrote:
>>>
>>>> See the doc for checkpoint:
>>>>
>>>>    * Mark this RDD for checkpointing. It will be saved to a file inside
>>>> the checkpoint
>>>>    * directory set with `SparkContext#setCheckpointDir` and all
>>>> references to its parent
>>>>    * RDDs will be removed. *This function must be called before any
>>>> job has been*
>>>> *   * executed on this RDD*. It is strongly recommended that this RDD
>>>> is persisted in
>>>>    * memory, otherwise saving it on a file will require recomputation.
>>>>
>>>> From the above description, you should not call it at the end of
>>>> transformations.
>>>>
>>>> Cheers
>>>>
>>>> On Wed, Mar 23, 2016 at 7:14 PM, Todd <[email protected]> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have a long computing chain, when I get the last RDD after a series
>>>>> of transformation. I have two choices to do with this last RDD
>>>>>
>>>>> 1. Call checkpoint on RDD to materialize it to disk
>>>>> 2. Call RDD.saveXXX to save it to HDFS, and read it back for further
>>>>> processing
>>>>>
>>>>> I would ask which choice is better? It looks to me that is not much
>>>>> difference between the two choices.
>>>>> Thanks!
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: What's the benifit of RDD checkpoint against RDD save

Reply via email to