Re: initial basic question from new user

Gerard Maas Thu, 12 Jun 2014 03:38:28 -0700

The goal of rdd.persist is to created a cached rdd that breaks the DAG
lineage. Therefore, computations *in the same job* that use that RDD can
re-use that intermediate result, but it's not meant to survive between job
runs.


for example:

val baseData = rawDataRdd.map(...).flatMap(...).reduceByKey(...).persist
val metric1 = baseData.flatMap(op1).reduceByKey.collect
val metric2 = baseData.flatMap(op2).reduceByKey.collect

Without persist, computing metric1 and metric2 would trigger the
computation starting from rawData. With persist, both metric1 and metric2
will start from the intermediate result (baseData)

If you need to ad-hoc persist to files, you can can save RDDs using
rdd.saveAsObjectFile(...) [1] and load them afterwards using
sparkContext.objectFile(...)
If you want to preserve the RDDs in memory between job runs, you should
look at the Spark-JobServer [3]

[1]
https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.rdd.RDD

[2]
https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.SparkContext

[3] https://github.com/ooyala/spark-jobserver



On Thu, Jun 12, 2014 at 11:24 AM, Toby Douglass <t...@avocet.io> wrote:

> Gents,
>
> I am investigating Spark with a view to perform reporting on a large data
> set, where the large data set receives additional data in the form of log
> files on an hourly basis.
>
> Where the data set is large there is a possibility we will create a range
> of aggregate tables, to reduce the volume of data which has to be processed.
>
> Having spent a little while reading up about Spark, my thought was that I
> could create an RDD which is an agg, persist this to disk, have reporting
> queries run against that RDD and when new data arrives, convert the new log
> file into an agg and add that to the agg RDD.
>
> However, I begin now to get the impression that RDDs cannot be persisted
> across jobs - I can generate an RDD, I can persist it, but I can see no way
> for a later job to load a persisted RDD (and I begin to think it will have
> been GCed anyway, at the end of the first job).  Is this correct?
>
>
>

Re: initial basic question from new user

Reply via email to