Hi Subash, I'm not sure how the checkpointing works, but with StorageLevel.MEMORY_AND_DISK, Spark will store the RDD in on-heap memory, and spill to disk if necessary. However, the data is only usable by that Spark job. Saving the RDD will write the data out to an external storage system, like HDFS or Alluxio <http://www.alluxio.org/docs/1.8/en/compute/Spark.html?utm_source=spark>.
There are some advantages of saving the RDD, mainly allowing different jobs or even different frameworks to read that data. One possibility is to save the RDD to Alluxio, which can store the data in-memory, improving the throughput by avoiding the disk. Here is an article discussing different ways to store RDDs <http://www.alluxio.com/blog/effective-spark-rdds-with-alluxio?utm_source=spark> . Thanks, Gene On Thu, Apr 18, 2019 at 10:49 AM Subash Prabakar <subashpraba...@gmail.com> wrote: > Hi All, > > I have a doubt about checkpointing and persist/saving. > > Say we have one RDD - containing huge data, > 1. We checkpoint and perform join > 2. We persist as StorageLevel.MEMORY_AND_DISK and perform join > 3. We save that intermediate RDD and perform join (using same RDD - saving > is to just persist intermediate result before joining) > > > Which of the above is faster and whats the difference? > > > Thanks, > Subash >