Need to maintain the consumer offset by myself when using spark streaming kafka direct approach?
I am using spark streaming kafka direct approach these days. I found that when I start the application, it always start consumer the latest offset. I hope that when application start, it consume from the offset last application consumes with the same kafka consumer group. It means I have to maintain the consumer offset by my self, for example record it on zookeeper, and reload the last offset from zookeeper when restarting the applicaiton? I see the following discussion: https://github.com/apache/spark/pull/4805 https://issues.apache.org/jira/browse/SPARK-6249 Is there any conclusion? Do we need to maintain the offset by myself? Or spark streaming will support a feature to simplify the offset maintain work? https://forums.databricks.com/questions/2936/need-to-maintain-the-consumer-offset-by-myself-whe.html
RE: Need to maintain the consumer offset by myself when using spark streaming kafka direct approach?
You need to maintain the offset yourself and rightly so in something like ZooKeeper. From: Tao Li [mailto:litao.bupt...@gmail.com] Sent: Tuesday, December 08, 2015 5:36 PM To: user@spark.apache.org Subject: Need to maintain the consumer offset by myself when using spark streaming kafka direct approach? I am using spark streaming kafka direct approach these days. I found that when I start the application, it always start consumer the latest offset. I hope that when application start, it consume from the offset last application consumes with the same kafka consumer group. It means I have to maintain the consumer offset by my self, for example record it on zookeeper, and reload the last offset from zookeeper when restarting the applicaiton? I see the following discussion: https://github.com/apache/spark/pull/4805 https://issues.apache.org/jira/browse/SPARK-6249 Is there any conclusion? Do we need to maintain the offset by myself? Or spark streaming will support a feature to simplify the offset maintain work? https://forums.databricks.com/questions/2936/need-to-maintain-the-consumer-offset-by-myself-whe.html
Re: Need to maintain the consumer offset by myself when using spark streaming kafka direct approach?
In direct stream checkpoint location is not recoverable if you modify your driver code. So if you just rely on checkpoint to commit offset , you can possibly loose messages if you modify driver code and you select offset from "largest" offset. If you do not want to loose messages, you need to commit offset to external store in case of direct stream. On Tue, Dec 8, 2015 at 7:47 PM, PhuDuc Nguyenwrote: > Kafka Receiver-based approach: > This will maintain the consumer offsets in ZK for you. > > Kafka Direct approach: > You can use checkpointing and that will maintain consumer offsets for you. > You'll want to checkpoint to a highly available file system like HDFS or S3. > > http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing > > You don't have to maintain your own offsets if you don't want to. If the 2 > solutions above don't satisfy your requirements, then consider writing your > own; otherwise I would recommend using the supported features in Spark. > > HTH, > Duc > > > > On Tue, Dec 8, 2015 at 5:05 AM, Tao Li wrote: > >> I am using spark streaming kafka direct approach these days. I found that >> when I start the application, it always start consumer the latest offset. I >> hope that when application start, it consume from the offset last >> application consumes with the same kafka consumer group. It means I have to >> maintain the consumer offset by my self, for example record it on >> zookeeper, and reload the last offset from zookeeper when restarting the >> applicaiton? >> >> I see the following discussion: >> https://github.com/apache/spark/pull/4805 >> https://issues.apache.org/jira/browse/SPARK-6249 >> >> Is there any conclusion? Do we need to maintain the offset by myself? Or >> spark streaming will support a feature to simplify the offset maintain work? >> >> >> https://forums.databricks.com/questions/2936/need-to-maintain-the-consumer-offset-by-myself-whe.html >> > >
Re: Need to maintain the consumer offset by myself when using spark streaming kafka direct approach?
Kafka Receiver-based approach: This will maintain the consumer offsets in ZK for you. Kafka Direct approach: You can use checkpointing and that will maintain consumer offsets for you. You'll want to checkpoint to a highly available file system like HDFS or S3. http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing You don't have to maintain your own offsets if you don't want to. If the 2 solutions above don't satisfy your requirements, then consider writing your own; otherwise I would recommend using the supported features in Spark. HTH, Duc On Tue, Dec 8, 2015 at 5:05 AM, Tao Liwrote: > I am using spark streaming kafka direct approach these days. I found that > when I start the application, it always start consumer the latest offset. I > hope that when application start, it consume from the offset last > application consumes with the same kafka consumer group. It means I have to > maintain the consumer offset by my self, for example record it on > zookeeper, and reload the last offset from zookeeper when restarting the > applicaiton? > > I see the following discussion: > https://github.com/apache/spark/pull/4805 > https://issues.apache.org/jira/browse/SPARK-6249 > > Is there any conclusion? Do we need to maintain the offset by myself? Or > spark streaming will support a feature to simplify the offset maintain work? > > > https://forums.databricks.com/questions/2936/need-to-maintain-the-consumer-offset-by-myself-whe.html >