Hello Adrian, A1: There is a significant difference between cache and checkpoint. Cache materializes the RDD and keeps it in memory and / or disk. But the lineage of RDD (that is, seq of operations that generated the RDD) will be remembered, so that if there are node failures and parts of the cached RDDs are lost, they an be regenerated. However, checkpoint saves the RDD to an HDFS file AND actually FORGETS the lineage completely. This is allows long lineages to be truncated and the data to be saved reliably in HDFS (which is naturally fault tolerant by replication).
A2: The answer to this is a little involved. There are actually two kinds of checkpoitning going on - (i) data checkpointing (i.e. RDD checkpoints) and (ii) metadata checkpointing, which stores the RDD checkpoint file names associated with DStreams to a file in HDFS, and other stuff The metadata checkpoint is done every batch. The data checkpointing, that is, the frequency at which RDDs of a DStream is checkpointed is configurable with DStream.checkpoint(...duration...). Duration must be a multiple of the batch interval / slide interval for windowed streams. However, by default, the interval of checkpointing is MAX(batch interval, 10 seconds). So if batch interval is more than 10 seconds, then every batch; if its less than 10 seconds, then every 10 seconds (multiple of batch interval that is more than 10 seconds). TD On Fri, Feb 14, 2014 at 8:47 AM, Adrian Mocanu <amoc...@verticalscope.com>wrote: > Hi > > > > Q1: > > Is .checkpoint() the same as .cache() but the difference is that it writes > to disk instead of memory? > > > > Q2: > > When is the checkpointed data rewritten in a streaming context? I assume > it is rewritten otherwise the file where the stream is checkpointed to > would grow forever. Is it every batch_duration? > > ie: > > For val streamingContext = new StreamingContext(conf, Seconds(10)) the > checkpointed data would be rewritten every 10 seconds. > > > > Thanks > > -Adrian > > >