You can't use checkpoints across code upgrades. That may or may not change in the future, but for now that's a limitation of spark checkpoints (regardless of whether you're using Kafka).
Some options: - Start up the new job on a different cluster, then kill the old job once it's caught up to where the new job started. If you care about duplicate work, you should be doing idempotent / transactional writes anyway, which should take care of the overlap between the two. If you're doing batches, you may need to be a little more careful about handling batch boundaries - Store the offsets somewhere other than the checkpoint, and provide them on startup using the fromOffsets argument to createDirectStream On Thu, Jul 30, 2015 at 4:07 AM, Nicola Ferraro <nibbi...@gmail.com> wrote: > Hi, > I've read about the recent updates about spark-streaming integration with > Kafka (I refer to the new approach without receivers). > In the new approach, metadata are persisted in checkpoint folders on HDFS > so that the SparkStreaming context can be recreated in case of failures. > This means that the streaming application will restart from the where it > exited and the message consuming process continues with new messages only. > Also, if I manually stop the streaming process and recreate the context > from checkpoint (using an approach similar to > https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/RecoverableNetworkWordCount.scala), > the behavior would be the same. > > Now, suppose I want to change something in the software and modify the > processing pipeline. > Can spark use the previous checkpoint to recreate the new application? > Will I ever be able to upgrade the software without processing all the > messages in Kafka again? > > Regards, > Nicola >