Hi, Is there a preferred way of approaching reprocessing historic data with streaming jobs?
I want to pose this as a general question, but I'm working with Pubsub and Dataflow specifically. I am a fan of the idea of replaying/fast forwarding through historic data to reproduce results (as you perhaps would with Kafka), but I'm having a hard time unifying this way of thinking with the concepts of watermarks and late data in Beam. I'm not sure how to best mimic this with the tools I'm using, or if there is a better way. If there is a previous discussion about this I might have missed (and I'm guessing there is), please direct me to it! The use case: Suppose I discover a bug in a streaming job with event time windows and an allowed lateness of 7 days, and that I subsequently have to reprocess all the data for the past month. Let us also assume that I have an archive of my source data (in my case in Google cloud storage) and that I can republish it all to the message queue I'm using. Some ideas that may or may not work I would love to get your thoughts on: 1) Start a new instance of the job that reads from a separate source to which I republish all messages. This shouldn't work because 14 days of my data is later than the allowed limit, buy the remaining 7 days should be reprocessed as intended. 2) The same as 1), but with allowed lateness of one month. When the job is caught up, the lateness can be adjusted back to 7 days. I am afraid this approach may consume too much memory since I'm letting a whole month of windows remain in memory. Also I wouldn't get the same triggering behaviour as in the original job since most or all of the data is late with respect to the watermark, which I assume is near real time when the historic data enters the pipeline. 3) The same as 1), but with the republishing first and only starting the new job when all messages are already waiting in the queue. The watermark should then start one month back in time and only catch up with the present once all the data is reprocessed, yielding no late data. (Experiments I've done with this approach produce somewhat unexpected results where early panes that are older than 7 days appear to be both the first and the last firing from their respective windows.) Early firings triggered by processing time would probably differ by the results should be the same? This approach also feels a bit awkward as it requires more orchestration. 4) Batch process the archived data instead and start a streaming job in parallel. Would this in a sense be a more honest approach since I'm actually reprocessing batches of archived data? The triggering behaviour in the streaming version of the job would not apply in batch, and I would want to avoid stitching together results from two jobs if I can. These are the approaches I've thought of currently, and any input is much appreciated. Have any of you faced similar situations, and how did you solve them? Regards, Lars
