Kafka Receiver-based approach: This will maintain the consumer offsets in ZK for you.
Kafka Direct approach: You can use checkpointing and that will maintain consumer offsets for you. You'll want to checkpoint to a highly available file system like HDFS or S3. http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing You don't have to maintain your own offsets if you don't want to. If the 2 solutions above don't satisfy your requirements, then consider writing your own; otherwise I would recommend using the supported features in Spark. HTH, Duc On Tue, Dec 8, 2015 at 5:05 AM, Tao Li <[email protected]> wrote: > I am using spark streaming kafka direct approach these days. I found that > when I start the application, it always start consumer the latest offset. I > hope that when application start, it consume from the offset last > application consumes with the same kafka consumer group. It means I have to > maintain the consumer offset by my self, for example record it on > zookeeper, and reload the last offset from zookeeper when restarting the > applicaiton? > > I see the following discussion: > https://github.com/apache/spark/pull/4805 > https://issues.apache.org/jira/browse/SPARK-6249 > > Is there any conclusion? Do we need to maintain the offset by myself? Or > spark streaming will support a feature to simplify the offset maintain work? > > > https://forums.databricks.com/questions/2936/need-to-maintain-the-consumer-offset-by-myself-whe.html >
