Re: Kafka as source for batch job

Tzu-Li (Gordon) Tai Thu, 08 Feb 2018 02:25:05 -0800

Hi Hayden,

Have you tried looking into `KeyedDeserializationSchema#isEndOfStream` [1]?
I think that could be what you are looking for. It signals the end of the 
stream when consuming from Kafka.


Cheers,
Gordon

On 8 February 2018 at 10:44:59 AM, Marchant, Hayden (hayden.march...@citi.com) 
wrote:

I know that traditionally Kafka is used as a source for a streaming job. In our 
particular case, we are looking at extracting records from a Kafka topic from a 
particular well-defined offset range (per partition) - i.e. from offset X to 
offset Y. In this case, we'd somehow want the application to know that it has 
finished when it gets to offset Y. This is basically changes Kafka stream to be 
bounded data as opposed to unbounded in the usual Stream paradigm. 

What would be the best approach to do this in Flink? I see a few options, 
though there might be more: 

1. Use a regular streaming job, and have some external service that monitors 
the current offsets of the consumer group of the topic and manually stops job 
when the consumer group of the topic has finished 
Pros - simple wrt Flink, Cons - hacky 

2. Create a batch job, and a new InputFormat based on Kafka that reads the 
specified subset of Kafka topic into the source. 
Pros - represent bounded data from Kafka topic as batch source, Cons - requires 
implementation of source. 

3. Dump the subset of Kafka into a file and then trigger a more 'traditional' 
Flink batch job that reads from a file. 
Pros - simple, cons - unnecessary I/O. 

I personally prefer 1 and 3 for simplicity. Has anyone done anything like this 
before? 

Thanks, 
Hayden Marchant

Re: Kafka as source for batch job

Reply via email to