I have a use case where I need to start a stream replaying historical data, and 
then have it continue processing on a live kafka source, and am looking for 
guidance / best practices for implementation.

Basically, I want to start up a new “version” of the stream job, and have it 
process each element from a specified range of historical data, before 
continuing to process records from the live source. I'd then let the job catch 
up to current time and atomically switch to it as the “live/current” job and 
then shut down the previously running one. The issue I’m struggling with is 
that switch-over from the historical source to the live source. Is there a way 
to build a composite stream source which would emit records from a bounded data 
set before consuming from a kafka topic, for example? Or would I be better off 
stopping the job once it’s read through the historical set, switching it’s 
source to the live topic and re-starting it? Some of our jobs rely on rolling 
fold state, so I think I need to resume from the save point of the historical 
processing.



Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply via email to