Thanks, Mark. Since checkpoint may get cleaned up later on, it seems option #2 (saveXXX) is viable.
On Wed, Mar 23, 2016 at 8:01 PM, Mark Hamstra <[email protected]> wrote: > Yes, the terminology is being used sloppily/non-standardly in this thread > -- "the last RDD" after a series of transformation is the RDD at the > beginning of the chain, just now with an attached chain of "to be done" > transformations when an action is eventually run. If the saveXXX action is > the only action being performed on the RDD, the rest of the chain being > purely transformations, then checkpointing instead of saving still wouldn't > execute any action on the RDD -- it would just mark the point at which > checkpointing should be done when an action is eventually run. > > On Wed, Mar 23, 2016 at 7:38 PM, Ted Yu <[email protected]> wrote: > >> bq. when I get the last RDD >> If I read Todd's first email correctly, the computation has been done. >> I could be wrong. >> >> On Wed, Mar 23, 2016 at 7:34 PM, Mark Hamstra <[email protected]> >> wrote: >> >>> Neither of you is making any sense to me. If you just have an RDD for >>> which you have specified a series of transformations but you haven't run >>> any actions, then neither checkpointing nor saving makes sense -- you >>> haven't computed anything yet, you've only written out the recipe for how >>> the computation should be done when it is needed. Neither does the "called >>> before any job" comment pose any restriction in this case since no jobs >>> have yet been executed on the RDD. >>> >>> On Wed, Mar 23, 2016 at 7:18 PM, Ted Yu <[email protected]> wrote: >>> >>>> See the doc for checkpoint: >>>> >>>> * Mark this RDD for checkpointing. It will be saved to a file inside >>>> the checkpoint >>>> * directory set with `SparkContext#setCheckpointDir` and all >>>> references to its parent >>>> * RDDs will be removed. *This function must be called before any >>>> job has been* >>>> * * executed on this RDD*. It is strongly recommended that this RDD >>>> is persisted in >>>> * memory, otherwise saving it on a file will require recomputation. >>>> >>>> From the above description, you should not call it at the end of >>>> transformations. >>>> >>>> Cheers >>>> >>>> On Wed, Mar 23, 2016 at 7:14 PM, Todd <[email protected]> wrote: >>>> >>>>> Hi, >>>>> >>>>> I have a long computing chain, when I get the last RDD after a series >>>>> of transformation. I have two choices to do with this last RDD >>>>> >>>>> 1. Call checkpoint on RDD to materialize it to disk >>>>> 2. Call RDD.saveXXX to save it to HDFS, and read it back for further >>>>> processing >>>>> >>>>> I would ask which choice is better? It looks to me that is not much >>>>> difference between the two choices. >>>>> Thanks! >>>>> >>>>> >>>>> >>>> >>> >> >
