The best way to read historical data in a stream

Lasse Nedergaard Mon, 05 Jul 2021 04:46:13 -0700

Hi

I’m looking for some advice for the “right” way to load historical data into a 
stream.


The case is as follow. 
I have a stream, sometimes I need to match the current live stream data up 
against data stored in database, let say elasticsearch, I generate a side 
output with the query information and now want get the rows from elasticsearch 
the number of rows can be high so I want to read in a paginated way and forward 
each response downstream as received. This also means that I have to execute n 
queries against elasticsearch and I have to do it in order and I don’t know how 
many. (Search response tell me if there is more data)

1. Use Async IO
This work nice but if I read the data in a Paginated way I have to buffer all 
the data before I can return the result and it doesn’t scale. 

2. Iterate stream
The requirement is more recursive than iteration and have some limitations 
regarding checkpoints. 

3. Process function
Is not intended to do external IO operation as they take time to execute. 

4. Elasticsearch source together with Kafka
Store the sideoutput I Kafka and create a elasticsearch / Kafka source 
function. Complicated 

There could be other ways of doing it and I’m open for good ideas and 
suggestions how to handle this challenge

Thanks in advance 

Med venlig hilsen / Best regards
Lasse Nedergaard

The best way to read historical data in a stream

Reply via email to