Hello Adrian,

A1: There is a significant difference between cache and checkpoint. Cache
materializes the RDD and keeps it in memory and / or disk. But the lineage
of RDD (that is, seq of operations that generated the RDD) will be
remembered, so that if there are node failures and parts of the cached RDDs
are lost, they an be regenerated. However, checkpoint saves the RDD to an
HDFS file AND actually FORGETS the lineage completely. This is allows long
lineages to be truncated and the data to be saved reliably in HDFS (which
is naturally fault tolerant by replication).

A2: The answer to this is a little involved. There are actually two kinds
of checkpoitning going on -
(i) data checkpointing (i.e. RDD checkpoints) and
(ii) metadata checkpointing, which stores the RDD checkpoint file names
associated with DStreams to a file in HDFS, and other stuff

The metadata checkpoint is done every batch.
The data checkpointing, that is, the  frequency at which RDDs of a DStream
is checkpointed is configurable with DStream.checkpoint(...duration...).
Duration must be a multiple of the batch interval / slide interval for
windowed streams. However, by default, the interval of checkpointing is
MAX(batch interval, 10 seconds). So if batch interval is more than 10
seconds, then every batch; if its less than 10 seconds, then every 10
seconds (multiple of batch interval that is more than 10 seconds).

TD


On Fri, Feb 14, 2014 at 8:47 AM, Adrian Mocanu <amoc...@verticalscope.com>wrote:

>  Hi
>
>
>
> Q1:
>
> Is .checkpoint() the same as .cache() but the difference is that it writes
> to disk instead of memory?
>
>
>
> Q2:
>
> When is the checkpointed data rewritten in a streaming context? I assume
> it is rewritten otherwise the file where the stream is checkpointed to
> would grow forever. Is it every batch_duration?
>
> ie:
>
> For  val streamingContext = new StreamingContext(conf, Seconds(10))  the
> checkpointed data would be rewritten every 10 seconds.
>
>
>
> Thanks
>
> -Adrian
>
>
>

Reply via email to