You can't use checkpoints across code upgrades.  That may or may not change
in the future, but for now that's a limitation of spark checkpoints
(regardless of whether you're using Kafka).

Some options:

- Start up the new job on a different cluster, then kill the old job once
it's caught up to where the new job started.  If you care about duplicate
work, you should be doing idempotent / transactional writes anyway, which
should take care of the overlap between the two.  If you're doing batches,
you may need to be a little more careful about handling batch boundaries

- Store the offsets somewhere other than the checkpoint, and provide them
on startup using the fromOffsets argument to createDirectStream





On Thu, Jul 30, 2015 at 4:07 AM, Nicola Ferraro <nibbi...@gmail.com> wrote:

> Hi,
> I've read about the recent updates about spark-streaming integration with
> Kafka (I refer to the new approach without receivers).
> In the new approach, metadata are persisted in checkpoint folders on HDFS
> so that the SparkStreaming context can be recreated in case of failures.
> This means that the streaming application will restart from the where it
> exited and the message consuming process continues with new messages only.
> Also, if I manually stop the streaming process and recreate the context
> from checkpoint (using an approach similar to
> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/RecoverableNetworkWordCount.scala),
> the behavior would be the same.
>
> Now, suppose I want to change something in the software and modify the
> processing pipeline.
> Can spark use the previous checkpoint to recreate the new application?
> Will I ever be able to upgrade the software without processing all the
> messages in Kafka again?
>
> Regards,
> Nicola
>

Reply via email to