Good morning,

I have a large scale job that for certain size of input breaks so I am
trying to play with checkpointing to split the DAG and understand the
problematic point. I have some questions about checkpointing:

   1. What is the utility of non-eager checkpointing?
   2. How checkpointing is different than manually write a dataframe (or
   rdd) to hdfs? Also, doing that will allow to re-read the stored dataframe,
   while with chekpointing I don't see a simple way of re-reading them in a
   future job
   3. I read that checkpointing is different than persisting because the
   lineage is not stored, but I don't understand why persisting stores the
   lineage. The point of persisting is that next computation will start from
   the persisted data (either mem or mem+disk), so what is the advantage of
   having the lineage available? Am I missing some basic understanding of
   these 2 apparently different operations?

Thanks,
*Alessandro Liparoti*

Reply via email to