Re: Is there a way to know when I've reached the end of a partition (consumed all messages) when using the high-level consumer?

2015-05-11 Thread Bhavesh Mistry
I have used what Gwen has suggested but to avoid false positive:

While consuming records keep track of *last* consumed offset and compare
with latest offset on broker for consumed topic when you get TimeOut
Exception for that particular partition for given topic (e.g JMX Bean
*LogEndOffset
*for consumed topic for given partition.

This works well.  In our use case,  we were using High Level Consumer for
only *single* topic.

I hope this helps !


Thanks,

Bhavesh

On Sun, May 10, 2015 at 2:03 PM, Ewen Cheslack-Postava e...@confluent.io
wrote:

 @Gwen- But that only works for topics that have low enough traffic that you
 would ever actually hit that timeout.

 The Confluent schema registry needs to do something similar to make sure it
 has fully consumed the topic it stores data in so it doesn't serve stale
 data. We know in our case we'll only have a single producer to the topic
 (the current leader of the schema registry cluster) so we have a different
 solution. We produce a message to the topic (which is 1 partition, but this
 works for a topic partition too), grab the resulting offset from the
 response, then consume until we see the message we produced. Obviously this
 isn't ideal since we a) have to produce extra bogus messages to the topic
 and b) it only works in the case where you know the consumer is also the
 only producer.

 The new consumer interface sort of addresses this since it has seek
 functionality, where one of the options is seekToEnd. However, I think you
 have to be very careful with this, especially using the current
 implementation. It seeks to the end, but it also marks those messages as
 consumed. This means that even if you keep track of your original position
 and seek back to it, if you use background offset commits you could end up
 committing incorrect offsets, crashing, and then missing some messages when
 another consumer claims that partition (or just due to another consumer
 joining the group).

 Not sure if there are many other use cases for grabbing the offset data
 with a simple API. Might mean there's a use case for either some additional
 API or some utilities independent of an actual consumer instance which
 allow you to easily query the state of topics/partitions.


 On Sun, May 10, 2015 at 12:43 AM, Gwen Shapira gshap...@cloudera.com
 wrote:

  For Flume, we use the timeout configuration and catch the exception, with
  the assumption that no messages for few seconds == the end.
 
  On Sat, May 9, 2015 at 2:04 AM, James Cheng jch...@tivo.com wrote:
 
   Hi,
  
   I want to use the high level consumer to read all partitions for a
 topic,
   and know when I have reached the end. I know the end might be a
  little
   vague, since items keep showing up, but I'm trying to get as close as
   possible. I know that more messages might show up later, but I want to
  know
   when I've received all the items that are currently available in the
  topic.
  
   Is there a standard/recommended way to do this?
  
   I know one way to do it is to first issue an OffsetRequest for each
   partition, which would get me the last offset, and then use that
   information in my high level consumer to detect when I've reached that
 a
   message with that offset. Which is exactly what the SimpleConsumer
  example
   does (
  
 
 https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example
  ).
   That involves finding the leader for the partition, etc etc. Not hard,
  but
   a bunch of steps.
  
   I noticed that kafkacat has an option similar to what I'm looking for:
 -e Exit successfully when last message received
  
   Looking at the code, it appears that a FetchRequest returns the
   HighwaterMarkOffset mark for a partition, and the API docs confirm
 that:
  
 
 https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-FetchResponse
  
   Does the Java high-level consumer expose the HighwaterMarkOffset in any
   way? I looked but I couldn't find such a thing.
  
   Thanks,
   -James
  
  
 



 --
 Thanks,
 Ewen



Re: Is there a way to know when I've reached the end of a partition (consumed all messages) when using the high-level consumer?

2015-05-11 Thread James Cheng
Thanks everyone.

To answer Charlie's question:

I'm doing some simple stream processing. I have Topics A,B, and C, all using 
log compaction and all recordings having primary keys. The data in Topic A is 
essentially a routing table that tells me which primary keys in Topics B and C 
I should pay attention to. So before I start consuming B and C, I need to have 
all/most of Topic A loaded into a local routing table.  As Topic A is updated, 
then I will continue to update my routing table, and use it to continually 
process events coming from B and C.

Hope that makes sense.

All of the solutions look good. Will, that patch does exactly what I want, but 
I'm not sure I want to patch Kafka right now. I'll keep it in mind. Thanks.

-James

On May 9, 2015, at 10:42 AM, Charlie Knudsen charlie.knud...@smartthings.com 
wrote:

 Hi James,
 What are you trying to do exactly? If all you are trying to do is monitor
 how far behind a consumer is getting you could use the ConsumerOffsetChecker.
 As described in the link below.
 http://community.spiceworks.com/how_to/77610-how-far-behind-is-your-kafka-consumer
 
 Each message being processed will also have the offset and partition
 attached to it so with that data. I suppose that information plus info from
 a fetch response you could determine this with in an application.
 https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-FetchResponse
 
 Does that help?
 
 
 On Fri, May 8, 2015 at 6:04 PM, James Cheng jch...@tivo.com wrote:
 
 Hi,
 
 I want to use the high level consumer to read all partitions for a topic,
 and know when I have reached the end. I know the end might be a little
 vague, since items keep showing up, but I'm trying to get as close as
 possible. I know that more messages might show up later, but I want to know
 when I've received all the items that are currently available in the topic.
 
 Is there a standard/recommended way to do this?
 
 I know one way to do it is to first issue an OffsetRequest for each
 partition, which would get me the last offset, and then use that
 information in my high level consumer to detect when I've reached that a
 message with that offset. Which is exactly what the SimpleConsumer example
 does (
 https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example).
 That involves finding the leader for the partition, etc etc. Not hard, but
 a bunch of steps.
 
 I noticed that kafkacat has an option similar to what I'm looking for:
  -e Exit successfully when last message received
 
 Looking at the code, it appears that a FetchRequest returns the
 HighwaterMarkOffset mark for a partition, and the API docs confirm that:
 https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-FetchResponse
 
 Does the Java high-level consumer expose the HighwaterMarkOffset in any
 way? I looked but I couldn't find such a thing.
 
 Thanks,
 -James
 
 



Re: Is there a way to know when I've reached the end of a partition (consumed all messages) when using the high-level consumer?

2015-05-10 Thread Ewen Cheslack-Postava
@Gwen- But that only works for topics that have low enough traffic that you
would ever actually hit that timeout.

The Confluent schema registry needs to do something similar to make sure it
has fully consumed the topic it stores data in so it doesn't serve stale
data. We know in our case we'll only have a single producer to the topic
(the current leader of the schema registry cluster) so we have a different
solution. We produce a message to the topic (which is 1 partition, but this
works for a topic partition too), grab the resulting offset from the
response, then consume until we see the message we produced. Obviously this
isn't ideal since we a) have to produce extra bogus messages to the topic
and b) it only works in the case where you know the consumer is also the
only producer.

The new consumer interface sort of addresses this since it has seek
functionality, where one of the options is seekToEnd. However, I think you
have to be very careful with this, especially using the current
implementation. It seeks to the end, but it also marks those messages as
consumed. This means that even if you keep track of your original position
and seek back to it, if you use background offset commits you could end up
committing incorrect offsets, crashing, and then missing some messages when
another consumer claims that partition (or just due to another consumer
joining the group).

Not sure if there are many other use cases for grabbing the offset data
with a simple API. Might mean there's a use case for either some additional
API or some utilities independent of an actual consumer instance which
allow you to easily query the state of topics/partitions.


On Sun, May 10, 2015 at 12:43 AM, Gwen Shapira gshap...@cloudera.com
wrote:

 For Flume, we use the timeout configuration and catch the exception, with
 the assumption that no messages for few seconds == the end.

 On Sat, May 9, 2015 at 2:04 AM, James Cheng jch...@tivo.com wrote:

  Hi,
 
  I want to use the high level consumer to read all partitions for a topic,
  and know when I have reached the end. I know the end might be a
 little
  vague, since items keep showing up, but I'm trying to get as close as
  possible. I know that more messages might show up later, but I want to
 know
  when I've received all the items that are currently available in the
 topic.
 
  Is there a standard/recommended way to do this?
 
  I know one way to do it is to first issue an OffsetRequest for each
  partition, which would get me the last offset, and then use that
  information in my high level consumer to detect when I've reached that a
  message with that offset. Which is exactly what the SimpleConsumer
 example
  does (
 
 https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example
 ).
  That involves finding the leader for the partition, etc etc. Not hard,
 but
  a bunch of steps.
 
  I noticed that kafkacat has an option similar to what I'm looking for:
-e Exit successfully when last message received
 
  Looking at the code, it appears that a FetchRequest returns the
  HighwaterMarkOffset mark for a partition, and the API docs confirm that:
 
 https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-FetchResponse
 
  Does the Java high-level consumer expose the HighwaterMarkOffset in any
  way? I looked but I couldn't find such a thing.
 
  Thanks,
  -James
 
 




-- 
Thanks,
Ewen


Re: Is there a way to know when I've reached the end of a partition (consumed all messages) when using the high-level consumer?

2015-05-10 Thread Gwen Shapira
For Flume, we use the timeout configuration and catch the exception, with
the assumption that no messages for few seconds == the end.

On Sat, May 9, 2015 at 2:04 AM, James Cheng jch...@tivo.com wrote:

 Hi,

 I want to use the high level consumer to read all partitions for a topic,
 and know when I have reached the end. I know the end might be a little
 vague, since items keep showing up, but I'm trying to get as close as
 possible. I know that more messages might show up later, but I want to know
 when I've received all the items that are currently available in the topic.

 Is there a standard/recommended way to do this?

 I know one way to do it is to first issue an OffsetRequest for each
 partition, which would get me the last offset, and then use that
 information in my high level consumer to detect when I've reached that a
 message with that offset. Which is exactly what the SimpleConsumer example
 does (
 https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example).
 That involves finding the leader for the partition, etc etc. Not hard, but
 a bunch of steps.

 I noticed that kafkacat has an option similar to what I'm looking for:
   -e Exit successfully when last message received

 Looking at the code, it appears that a FetchRequest returns the
 HighwaterMarkOffset mark for a partition, and the API docs confirm that:
 https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-FetchResponse

 Does the Java high-level consumer expose the HighwaterMarkOffset in any
 way? I looked but I couldn't find such a thing.

 Thanks,
 -James




Re: Is there a way to know when I've reached the end of a partition (consumed all messages) when using the high-level consumer?

2015-05-09 Thread Charlie Knudsen
Hi James,
What are you trying to do exactly? If all you are trying to do is monitor
how far behind a consumer is getting you could use the ConsumerOffsetChecker.
As described in the link below.
http://community.spiceworks.com/how_to/77610-how-far-behind-is-your-kafka-consumer

Each message being processed will also have the offset and partition
attached to it so with that data. I suppose that information plus info from
a fetch response you could determine this with in an application.
https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-FetchResponse

Does that help?


On Fri, May 8, 2015 at 6:04 PM, James Cheng jch...@tivo.com wrote:

 Hi,

 I want to use the high level consumer to read all partitions for a topic,
 and know when I have reached the end. I know the end might be a little
 vague, since items keep showing up, but I'm trying to get as close as
 possible. I know that more messages might show up later, but I want to know
 when I've received all the items that are currently available in the topic.

 Is there a standard/recommended way to do this?

 I know one way to do it is to first issue an OffsetRequest for each
 partition, which would get me the last offset, and then use that
 information in my high level consumer to detect when I've reached that a
 message with that offset. Which is exactly what the SimpleConsumer example
 does (
 https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example).
 That involves finding the leader for the partition, etc etc. Not hard, but
 a bunch of steps.

 I noticed that kafkacat has an option similar to what I'm looking for:
   -e Exit successfully when last message received

 Looking at the code, it appears that a FetchRequest returns the
 HighwaterMarkOffset mark for a partition, and the API docs confirm that:
 https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-FetchResponse

 Does the Java high-level consumer expose the HighwaterMarkOffset in any
 way? I looked but I couldn't find such a thing.

 Thanks,
 -James




Re: Is there a way to know when I've reached the end of a partition (consumed all messages) when using the high-level consumer?

2015-05-09 Thread Will Funnell
I've created a patch to expose the high end watermark, having this exact
requirement.

Still waiting for it to be accepted, but are using this in production at
the moment and it works quite nicely:
https://issues.apache.org/jira/browse/KAFKA-1977



On Sat, 9 May 2015 at 18:43 Charlie Knudsen charlie.knud...@smartthings.com
wrote:

 Hi James,
 What are you trying to do exactly? If all you are trying to do is monitor
 how far behind a consumer is getting you could use the
 ConsumerOffsetChecker.
 As described in the link below.

 http://community.spiceworks.com/how_to/77610-how-far-behind-is-your-kafka-consumer

 Each message being processed will also have the offset and partition
 attached to it so with that data. I suppose that information plus info from
 a fetch response you could determine this with in an application.

 https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-FetchResponse

 Does that help?


 On Fri, May 8, 2015 at 6:04 PM, James Cheng jch...@tivo.com wrote:

  Hi,
 
  I want to use the high level consumer to read all partitions for a topic,
  and know when I have reached the end. I know the end might be a
 little
  vague, since items keep showing up, but I'm trying to get as close as
  possible. I know that more messages might show up later, but I want to
 know
  when I've received all the items that are currently available in the
 topic.
 
  Is there a standard/recommended way to do this?
 
  I know one way to do it is to first issue an OffsetRequest for each
  partition, which would get me the last offset, and then use that
  information in my high level consumer to detect when I've reached that a
  message with that offset. Which is exactly what the SimpleConsumer
 example
  does (
 
 https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example
 ).
  That involves finding the leader for the partition, etc etc. Not hard,
 but
  a bunch of steps.
 
  I noticed that kafkacat has an option similar to what I'm looking for:
-e Exit successfully when last message received
 
  Looking at the code, it appears that a FetchRequest returns the
  HighwaterMarkOffset mark for a partition, and the API docs confirm that:
 
 https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-FetchResponse
 
  Does the Java high-level consumer expose the HighwaterMarkOffset in any
  way? I looked but I couldn't find such a thing.
 
  Thanks,
  -James