Kafka Receiver-based approach:
This will maintain the consumer offsets in ZK for you.

Kafka Direct approach:
You can use checkpointing and that will maintain consumer offsets for you.
You'll want to checkpoint to a highly available file system like HDFS or S3.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing

You don't have to maintain your own offsets if you don't want to. If the 2
solutions above don't satisfy your requirements, then consider writing your
own; otherwise I would recommend using the supported features in Spark.

HTH,
Duc



On Tue, Dec 8, 2015 at 5:05 AM, Tao Li <[email protected]> wrote:

> I am using spark streaming kafka direct approach these days. I found that
> when I start the application, it always start consumer the latest offset. I
> hope that when application start, it consume from the offset last
> application consumes with the same kafka consumer group. It means I have to
> maintain the consumer offset by my self, for example record it on
> zookeeper, and reload the last offset from zookeeper when restarting the
> applicaiton?
>
> I see the following discussion:
> https://github.com/apache/spark/pull/4805
> https://issues.apache.org/jira/browse/SPARK-6249
>
> Is there any conclusion? Do we need to maintain the offset by myself? Or
> spark streaming will support a feature to simplify the offset maintain work?
>
>
> https://forums.databricks.com/questions/2936/need-to-maintain-the-consumer-offset-by-myself-whe.html
>

Reply via email to