Hi Subash,

I'm not sure how the checkpointing works, but with
StorageLevel.MEMORY_AND_DISK, Spark will store the RDD in on-heap memory,
and spill to disk if necessary. However, the data is only usable by that
Spark job. Saving the RDD will write the data out to an external storage
system, like HDFS or Alluxio
<http://www.alluxio.org/docs/1.8/en/compute/Spark.html?utm_source=spark>.

There are some advantages of saving the RDD, mainly allowing different jobs
or even different frameworks to read that data. One possibility is to save
the RDD to Alluxio, which can store the data in-memory, improving the
throughput by avoiding the disk. Here is an article discussing different
ways to store RDDs
<http://www.alluxio.com/blog/effective-spark-rdds-with-alluxio?utm_source=spark>
.

Thanks,
Gene

On Thu, Apr 18, 2019 at 10:49 AM Subash Prabakar <subashpraba...@gmail.com>
wrote:

> Hi All,
>
> I have a doubt about checkpointing and persist/saving.
>
> Say we have one RDD - containing huge data,
> 1. We checkpoint and perform join
> 2. We persist as StorageLevel.MEMORY_AND_DISK and perform join
> 3. We save that intermediate RDD and perform join (using same RDD - saving
> is to just persist intermediate result before joining)
>
>
> Which of the above is faster and whats the difference?
>
>
> Thanks,
> Subash
>

Reply via email to