Have you read https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md ?
1. There's nothing preventing that. 2. Checkpointing will give you at-least-once semantics, provided you have sufficient kafka retention. Be aware that checkpoints aren't recoverable if you upgrade code. On Tue, May 19, 2015 at 12:42 PM, Bill Jay <bill.jaypeter...@gmail.com> wrote: > Hi all, > > I am currently using Spark streaming to consume and save logs every hour > in our production pipeline. The current setting is to run a crontab job to > check every minute whether the job is still there and if not resubmit a > Spark streaming job. I am currently using the direct approach for Kafka > consumer. I have two questions: > > 1. In the direct approach, no offset is stored in zookeeper and no group > id is specified. Can two consumers (one is Spark streaming and the other is > a Kafak console consumer in Kafka package) read from the same topic from > the brokers together (I would like both of them to get all messages, i.e. > publish-subscribe mode)? What about two Spark streaming jobs read from the > same topic? > > 2. How to avoid data loss if a Spark job is killed? Does checkpointing > serve this purpose? The default behavior of Spark streaming is to read the > latest logs. However, if a job is killed, can the new job resume from what > was left to avoid loosing logs? > > Thanks! > > Bill >