Need to maintain the consumer offset by myself when using spark streaming kafka direct approach?

2015-12-08 Thread Tao Li
I am using spark streaming kafka direct approach these days. I found that
when I start the application, it always start consumer the latest offset. I
hope that when application start, it consume from the offset last
application consumes with the same kafka consumer group. It means I have to
maintain the consumer offset by my self, for example record it on
zookeeper, and reload the last offset from zookeeper when restarting the
applicaiton?

I see the following discussion:
https://github.com/apache/spark/pull/4805
https://issues.apache.org/jira/browse/SPARK-6249

Is there any conclusion? Do we need to maintain the offset by myself? Or
spark streaming will support a feature to simplify the offset maintain work?

https://forums.databricks.com/questions/2936/need-to-maintain-the-consumer-offset-by-myself-whe.html


RE: Need to maintain the consumer offset by myself when using spark streaming kafka direct approach?

2015-12-08 Thread Singh, Abhijeet
You need to maintain the offset yourself and rightly so in something like 
ZooKeeper.

From: Tao Li [mailto:litao.bupt...@gmail.com]
Sent: Tuesday, December 08, 2015 5:36 PM
To: user@spark.apache.org
Subject: Need to maintain the consumer offset by myself when using spark 
streaming kafka direct approach?

I am using spark streaming kafka direct approach these days. I found that when 
I start the application, it always start consumer the latest offset. I hope 
that when application start, it consume from the offset last application 
consumes with the same kafka consumer group. It means I have to maintain the 
consumer offset by my self, for example record it on zookeeper, and reload the 
last offset from zookeeper when restarting the applicaiton?

I see the following discussion:
https://github.com/apache/spark/pull/4805
https://issues.apache.org/jira/browse/SPARK-6249

Is there any conclusion? Do we need to maintain the offset by myself? Or spark 
streaming will support a feature to simplify the offset maintain work?

https://forums.databricks.com/questions/2936/need-to-maintain-the-consumer-offset-by-myself-whe.html


Re: Need to maintain the consumer offset by myself when using spark streaming kafka direct approach?

2015-12-08 Thread Dibyendu Bhattacharya
In direct stream checkpoint location is not recoverable if you modify your
driver code. So if you just rely on checkpoint to commit offset , you can
possibly loose messages if you modify driver code and you select  offset
from "largest" offset. If you do not want to loose messages,  you need to
commit offset to external store in case of direct stream.

On Tue, Dec 8, 2015 at 7:47 PM, PhuDuc Nguyen 
wrote:

> Kafka Receiver-based approach:
> This will maintain the consumer offsets in ZK for you.
>
> Kafka Direct approach:
> You can use checkpointing and that will maintain consumer offsets for you.
> You'll want to checkpoint to a highly available file system like HDFS or S3.
>
> http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing
>
> You don't have to maintain your own offsets if you don't want to. If the 2
> solutions above don't satisfy your requirements, then consider writing your
> own; otherwise I would recommend using the supported features in Spark.
>
> HTH,
> Duc
>
>
>
> On Tue, Dec 8, 2015 at 5:05 AM, Tao Li  wrote:
>
>> I am using spark streaming kafka direct approach these days. I found that
>> when I start the application, it always start consumer the latest offset. I
>> hope that when application start, it consume from the offset last
>> application consumes with the same kafka consumer group. It means I have to
>> maintain the consumer offset by my self, for example record it on
>> zookeeper, and reload the last offset from zookeeper when restarting the
>> applicaiton?
>>
>> I see the following discussion:
>> https://github.com/apache/spark/pull/4805
>> https://issues.apache.org/jira/browse/SPARK-6249
>>
>> Is there any conclusion? Do we need to maintain the offset by myself? Or
>> spark streaming will support a feature to simplify the offset maintain work?
>>
>>
>> https://forums.databricks.com/questions/2936/need-to-maintain-the-consumer-offset-by-myself-whe.html
>>
>
>


Re: Need to maintain the consumer offset by myself when using spark streaming kafka direct approach?

2015-12-08 Thread PhuDuc Nguyen
Kafka Receiver-based approach:
This will maintain the consumer offsets in ZK for you.

Kafka Direct approach:
You can use checkpointing and that will maintain consumer offsets for you.
You'll want to checkpoint to a highly available file system like HDFS or S3.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing

You don't have to maintain your own offsets if you don't want to. If the 2
solutions above don't satisfy your requirements, then consider writing your
own; otherwise I would recommend using the supported features in Spark.

HTH,
Duc



On Tue, Dec 8, 2015 at 5:05 AM, Tao Li  wrote:

> I am using spark streaming kafka direct approach these days. I found that
> when I start the application, it always start consumer the latest offset. I
> hope that when application start, it consume from the offset last
> application consumes with the same kafka consumer group. It means I have to
> maintain the consumer offset by my self, for example record it on
> zookeeper, and reload the last offset from zookeeper when restarting the
> applicaiton?
>
> I see the following discussion:
> https://github.com/apache/spark/pull/4805
> https://issues.apache.org/jira/browse/SPARK-6249
>
> Is there any conclusion? Do we need to maintain the offset by myself? Or
> spark streaming will support a feature to simplify the offset maintain work?
>
>
> https://forums.databricks.com/questions/2936/need-to-maintain-the-consumer-offset-by-myself-whe.html
>