Re: Is there a way to know when I've reached the end of a partition (consumed all messages) when using the high-level consumer?
I have used what Gwen has suggested but to avoid false positive: While consuming records keep track of *last* consumed offset and compare with latest offset on broker for consumed topic when you get TimeOut Exception for that particular partition for given topic (e.g JMX Bean *LogEndOffset *for consumed topic for given partition. This works well. In our use case, we were using High Level Consumer for only *single* topic. I hope this helps ! Thanks, Bhavesh On Sun, May 10, 2015 at 2:03 PM, Ewen Cheslack-Postava e...@confluent.io wrote: @Gwen- But that only works for topics that have low enough traffic that you would ever actually hit that timeout. The Confluent schema registry needs to do something similar to make sure it has fully consumed the topic it stores data in so it doesn't serve stale data. We know in our case we'll only have a single producer to the topic (the current leader of the schema registry cluster) so we have a different solution. We produce a message to the topic (which is 1 partition, but this works for a topic partition too), grab the resulting offset from the response, then consume until we see the message we produced. Obviously this isn't ideal since we a) have to produce extra bogus messages to the topic and b) it only works in the case where you know the consumer is also the only producer. The new consumer interface sort of addresses this since it has seek functionality, where one of the options is seekToEnd. However, I think you have to be very careful with this, especially using the current implementation. It seeks to the end, but it also marks those messages as consumed. This means that even if you keep track of your original position and seek back to it, if you use background offset commits you could end up committing incorrect offsets, crashing, and then missing some messages when another consumer claims that partition (or just due to another consumer joining the group). Not sure if there are many other use cases for grabbing the offset data with a simple API. Might mean there's a use case for either some additional API or some utilities independent of an actual consumer instance which allow you to easily query the state of topics/partitions. On Sun, May 10, 2015 at 12:43 AM, Gwen Shapira gshap...@cloudera.com wrote: For Flume, we use the timeout configuration and catch the exception, with the assumption that no messages for few seconds == the end. On Sat, May 9, 2015 at 2:04 AM, James Cheng jch...@tivo.com wrote: Hi, I want to use the high level consumer to read all partitions for a topic, and know when I have reached the end. I know the end might be a little vague, since items keep showing up, but I'm trying to get as close as possible. I know that more messages might show up later, but I want to know when I've received all the items that are currently available in the topic. Is there a standard/recommended way to do this? I know one way to do it is to first issue an OffsetRequest for each partition, which would get me the last offset, and then use that information in my high level consumer to detect when I've reached that a message with that offset. Which is exactly what the SimpleConsumer example does ( https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example ). That involves finding the leader for the partition, etc etc. Not hard, but a bunch of steps. I noticed that kafkacat has an option similar to what I'm looking for: -e Exit successfully when last message received Looking at the code, it appears that a FetchRequest returns the HighwaterMarkOffset mark for a partition, and the API docs confirm that: https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-FetchResponse Does the Java high-level consumer expose the HighwaterMarkOffset in any way? I looked but I couldn't find such a thing. Thanks, -James -- Thanks, Ewen
Re: Is there a way to know when I've reached the end of a partition (consumed all messages) when using the high-level consumer?
Thanks everyone. To answer Charlie's question: I'm doing some simple stream processing. I have Topics A,B, and C, all using log compaction and all recordings having primary keys. The data in Topic A is essentially a routing table that tells me which primary keys in Topics B and C I should pay attention to. So before I start consuming B and C, I need to have all/most of Topic A loaded into a local routing table. As Topic A is updated, then I will continue to update my routing table, and use it to continually process events coming from B and C. Hope that makes sense. All of the solutions look good. Will, that patch does exactly what I want, but I'm not sure I want to patch Kafka right now. I'll keep it in mind. Thanks. -James On May 9, 2015, at 10:42 AM, Charlie Knudsen charlie.knud...@smartthings.com wrote: Hi James, What are you trying to do exactly? If all you are trying to do is monitor how far behind a consumer is getting you could use the ConsumerOffsetChecker. As described in the link below. http://community.spiceworks.com/how_to/77610-how-far-behind-is-your-kafka-consumer Each message being processed will also have the offset and partition attached to it so with that data. I suppose that information plus info from a fetch response you could determine this with in an application. https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-FetchResponse Does that help? On Fri, May 8, 2015 at 6:04 PM, James Cheng jch...@tivo.com wrote: Hi, I want to use the high level consumer to read all partitions for a topic, and know when I have reached the end. I know the end might be a little vague, since items keep showing up, but I'm trying to get as close as possible. I know that more messages might show up later, but I want to know when I've received all the items that are currently available in the topic. Is there a standard/recommended way to do this? I know one way to do it is to first issue an OffsetRequest for each partition, which would get me the last offset, and then use that information in my high level consumer to detect when I've reached that a message with that offset. Which is exactly what the SimpleConsumer example does ( https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example). That involves finding the leader for the partition, etc etc. Not hard, but a bunch of steps. I noticed that kafkacat has an option similar to what I'm looking for: -e Exit successfully when last message received Looking at the code, it appears that a FetchRequest returns the HighwaterMarkOffset mark for a partition, and the API docs confirm that: https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-FetchResponse Does the Java high-level consumer expose the HighwaterMarkOffset in any way? I looked but I couldn't find such a thing. Thanks, -James
Re: Is there a way to know when I've reached the end of a partition (consumed all messages) when using the high-level consumer?
@Gwen- But that only works for topics that have low enough traffic that you would ever actually hit that timeout. The Confluent schema registry needs to do something similar to make sure it has fully consumed the topic it stores data in so it doesn't serve stale data. We know in our case we'll only have a single producer to the topic (the current leader of the schema registry cluster) so we have a different solution. We produce a message to the topic (which is 1 partition, but this works for a topic partition too), grab the resulting offset from the response, then consume until we see the message we produced. Obviously this isn't ideal since we a) have to produce extra bogus messages to the topic and b) it only works in the case where you know the consumer is also the only producer. The new consumer interface sort of addresses this since it has seek functionality, where one of the options is seekToEnd. However, I think you have to be very careful with this, especially using the current implementation. It seeks to the end, but it also marks those messages as consumed. This means that even if you keep track of your original position and seek back to it, if you use background offset commits you could end up committing incorrect offsets, crashing, and then missing some messages when another consumer claims that partition (or just due to another consumer joining the group). Not sure if there are many other use cases for grabbing the offset data with a simple API. Might mean there's a use case for either some additional API or some utilities independent of an actual consumer instance which allow you to easily query the state of topics/partitions. On Sun, May 10, 2015 at 12:43 AM, Gwen Shapira gshap...@cloudera.com wrote: For Flume, we use the timeout configuration and catch the exception, with the assumption that no messages for few seconds == the end. On Sat, May 9, 2015 at 2:04 AM, James Cheng jch...@tivo.com wrote: Hi, I want to use the high level consumer to read all partitions for a topic, and know when I have reached the end. I know the end might be a little vague, since items keep showing up, but I'm trying to get as close as possible. I know that more messages might show up later, but I want to know when I've received all the items that are currently available in the topic. Is there a standard/recommended way to do this? I know one way to do it is to first issue an OffsetRequest for each partition, which would get me the last offset, and then use that information in my high level consumer to detect when I've reached that a message with that offset. Which is exactly what the SimpleConsumer example does ( https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example ). That involves finding the leader for the partition, etc etc. Not hard, but a bunch of steps. I noticed that kafkacat has an option similar to what I'm looking for: -e Exit successfully when last message received Looking at the code, it appears that a FetchRequest returns the HighwaterMarkOffset mark for a partition, and the API docs confirm that: https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-FetchResponse Does the Java high-level consumer expose the HighwaterMarkOffset in any way? I looked but I couldn't find such a thing. Thanks, -James -- Thanks, Ewen
Re: Is there a way to know when I've reached the end of a partition (consumed all messages) when using the high-level consumer?
For Flume, we use the timeout configuration and catch the exception, with the assumption that no messages for few seconds == the end. On Sat, May 9, 2015 at 2:04 AM, James Cheng jch...@tivo.com wrote: Hi, I want to use the high level consumer to read all partitions for a topic, and know when I have reached the end. I know the end might be a little vague, since items keep showing up, but I'm trying to get as close as possible. I know that more messages might show up later, but I want to know when I've received all the items that are currently available in the topic. Is there a standard/recommended way to do this? I know one way to do it is to first issue an OffsetRequest for each partition, which would get me the last offset, and then use that information in my high level consumer to detect when I've reached that a message with that offset. Which is exactly what the SimpleConsumer example does ( https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example). That involves finding the leader for the partition, etc etc. Not hard, but a bunch of steps. I noticed that kafkacat has an option similar to what I'm looking for: -e Exit successfully when last message received Looking at the code, it appears that a FetchRequest returns the HighwaterMarkOffset mark for a partition, and the API docs confirm that: https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-FetchResponse Does the Java high-level consumer expose the HighwaterMarkOffset in any way? I looked but I couldn't find such a thing. Thanks, -James
Re: Is there a way to know when I've reached the end of a partition (consumed all messages) when using the high-level consumer?
Hi James, What are you trying to do exactly? If all you are trying to do is monitor how far behind a consumer is getting you could use the ConsumerOffsetChecker. As described in the link below. http://community.spiceworks.com/how_to/77610-how-far-behind-is-your-kafka-consumer Each message being processed will also have the offset and partition attached to it so with that data. I suppose that information plus info from a fetch response you could determine this with in an application. https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-FetchResponse Does that help? On Fri, May 8, 2015 at 6:04 PM, James Cheng jch...@tivo.com wrote: Hi, I want to use the high level consumer to read all partitions for a topic, and know when I have reached the end. I know the end might be a little vague, since items keep showing up, but I'm trying to get as close as possible. I know that more messages might show up later, but I want to know when I've received all the items that are currently available in the topic. Is there a standard/recommended way to do this? I know one way to do it is to first issue an OffsetRequest for each partition, which would get me the last offset, and then use that information in my high level consumer to detect when I've reached that a message with that offset. Which is exactly what the SimpleConsumer example does ( https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example). That involves finding the leader for the partition, etc etc. Not hard, but a bunch of steps. I noticed that kafkacat has an option similar to what I'm looking for: -e Exit successfully when last message received Looking at the code, it appears that a FetchRequest returns the HighwaterMarkOffset mark for a partition, and the API docs confirm that: https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-FetchResponse Does the Java high-level consumer expose the HighwaterMarkOffset in any way? I looked but I couldn't find such a thing. Thanks, -James
Re: Is there a way to know when I've reached the end of a partition (consumed all messages) when using the high-level consumer?
I've created a patch to expose the high end watermark, having this exact requirement. Still waiting for it to be accepted, but are using this in production at the moment and it works quite nicely: https://issues.apache.org/jira/browse/KAFKA-1977 On Sat, 9 May 2015 at 18:43 Charlie Knudsen charlie.knud...@smartthings.com wrote: Hi James, What are you trying to do exactly? If all you are trying to do is monitor how far behind a consumer is getting you could use the ConsumerOffsetChecker. As described in the link below. http://community.spiceworks.com/how_to/77610-how-far-behind-is-your-kafka-consumer Each message being processed will also have the offset and partition attached to it so with that data. I suppose that information plus info from a fetch response you could determine this with in an application. https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-FetchResponse Does that help? On Fri, May 8, 2015 at 6:04 PM, James Cheng jch...@tivo.com wrote: Hi, I want to use the high level consumer to read all partitions for a topic, and know when I have reached the end. I know the end might be a little vague, since items keep showing up, but I'm trying to get as close as possible. I know that more messages might show up later, but I want to know when I've received all the items that are currently available in the topic. Is there a standard/recommended way to do this? I know one way to do it is to first issue an OffsetRequest for each partition, which would get me the last offset, and then use that information in my high level consumer to detect when I've reached that a message with that offset. Which is exactly what the SimpleConsumer example does ( https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example ). That involves finding the leader for the partition, etc etc. Not hard, but a bunch of steps. I noticed that kafkacat has an option similar to what I'm looking for: -e Exit successfully when last message received Looking at the code, it appears that a FetchRequest returns the HighwaterMarkOffset mark for a partition, and the API docs confirm that: https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-FetchResponse Does the Java high-level consumer expose the HighwaterMarkOffset in any way? I looked but I couldn't find such a thing. Thanks, -James