The goal of rdd.persist is to created a cached rdd that breaks the DAG lineage. Therefore, computations *in the same job* that use that RDD can re-use that intermediate result, but it's not meant to survive between job runs.
for example: val baseData = rawDataRdd.map(...).flatMap(...).reduceByKey(...).persist val metric1 = baseData.flatMap(op1).reduceByKey.collect val metric2 = baseData.flatMap(op2).reduceByKey.collect Without persist, computing metric1 and metric2 would trigger the computation starting from rawData. With persist, both metric1 and metric2 will start from the intermediate result (baseData) If you need to ad-hoc persist to files, you can can save RDDs using rdd.saveAsObjectFile(...) [1] and load them afterwards using sparkContext.objectFile(...) If you want to preserve the RDDs in memory between job runs, you should look at the Spark-JobServer [3] [1] https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.rdd.RDD [2] https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.SparkContext [3] https://github.com/ooyala/spark-jobserver On Thu, Jun 12, 2014 at 11:24 AM, Toby Douglass <t...@avocet.io> wrote: > Gents, > > I am investigating Spark with a view to perform reporting on a large data > set, where the large data set receives additional data in the form of log > files on an hourly basis. > > Where the data set is large there is a possibility we will create a range > of aggregate tables, to reduce the volume of data which has to be processed. > > Having spent a little while reading up about Spark, my thought was that I > could create an RDD which is an agg, persist this to disk, have reporting > queries run against that RDD and when new data arrives, convert the new log > file into an agg and add that to the agg RDD. > > However, I begin now to get the impression that RDDs cannot be persisted > across jobs - I can generate an RDD, I can persist it, but I can see no way > for a later job to load a persisted RDD (and I begin to think it will have > been GCed anyway, at the end of the first job). Is this correct? > > >