Re: Spark Streaming + Kafka failure recovery

Cody Koeninger Tue, 19 May 2015 10:59:16 -0700

Have you read
https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md ?


1.  There's nothing preventing that.

2. Checkpointing will give you at-least-once semantics, provided you have
sufficient kafka retention.  Be aware that checkpoints aren't recoverable
if you upgrade code.

On Tue, May 19, 2015 at 12:42 PM, Bill Jay <bill.jaypeter...@gmail.com>
wrote:

> Hi all,
>
> I am currently using Spark streaming to consume and save logs every hour
> in our production pipeline. The current setting is to run a crontab job to
> check every minute whether the job is still there and if not resubmit a
> Spark streaming job. I am currently using the direct approach for Kafka
> consumer. I have two questions:
>
> 1. In the direct approach, no offset is stored in zookeeper and no group
> id is specified. Can two consumers (one is Spark streaming and the other is
> a Kafak console consumer in Kafka package) read from the same topic from
> the brokers together (I would like both of them to get all messages, i.e.
> publish-subscribe mode)? What about two Spark streaming jobs read from the
> same topic?
>
> 2. How to avoid data loss if a Spark job is killed? Does checkpointing
> serve this purpose? The default behavior of Spark streaming is to read the
> latest logs. However, if a job is killed, can the new job resume from what
> was left to avoid loosing logs?
>
> Thanks!
>
> Bill
>

Re: Spark Streaming + Kafka failure recovery

Reply via email to