bq. when I get the last RDD If I read Todd's first email correctly, the computation has been done. I could be wrong.
On Wed, Mar 23, 2016 at 7:34 PM, Mark Hamstra <[email protected]> wrote: > Neither of you is making any sense to me. If you just have an RDD for > which you have specified a series of transformations but you haven't run > any actions, then neither checkpointing nor saving makes sense -- you > haven't computed anything yet, you've only written out the recipe for how > the computation should be done when it is needed. Neither does the "called > before any job" comment pose any restriction in this case since no jobs > have yet been executed on the RDD. > > On Wed, Mar 23, 2016 at 7:18 PM, Ted Yu <[email protected]> wrote: > >> See the doc for checkpoint: >> >> * Mark this RDD for checkpointing. It will be saved to a file inside >> the checkpoint >> * directory set with `SparkContext#setCheckpointDir` and all >> references to its parent >> * RDDs will be removed. *This function must be called before any job >> has been* >> * * executed on this RDD*. It is strongly recommended that this RDD is >> persisted in >> * memory, otherwise saving it on a file will require recomputation. >> >> From the above description, you should not call it at the end of >> transformations. >> >> Cheers >> >> On Wed, Mar 23, 2016 at 7:14 PM, Todd <[email protected]> wrote: >> >>> Hi, >>> >>> I have a long computing chain, when I get the last RDD after a series of >>> transformation. I have two choices to do with this last RDD >>> >>> 1. Call checkpoint on RDD to materialize it to disk >>> 2. Call RDD.saveXXX to save it to HDFS, and read it back for further >>> processing >>> >>> I would ask which choice is better? It looks to me that is not much >>> difference between the two choices. >>> Thanks! >>> >>> >>> >> >
