I have a use case where I need to start a stream replaying historical data, and then have it continue processing on a live kafka source, and am looking for guidance / best practices for implementation.
Basically, I want to start up a new “version” of the stream job, and have it process each element from a specified range of historical data, before continuing to process records from the live source. I'd then let the job catch up to current time and atomically switch to it as the “live/current” job and then shut down the previously running one. The issue I’m struggling with is that switch-over from the historical source to the live source. Is there a way to build a composite stream source which would emit records from a bounded data set before consuming from a kafka topic, for example? Or would I be better off stopping the job once it’s read through the historical set, switching it’s source to the live topic and re-starting it? Some of our jobs rely on rolling fold state, so I think I need to resume from the save point of the historical processing.
signature.asc
Description: Message signed with OpenPGP using GPGMail