Hi,

Is there a preferred way of approaching reprocessing historic data with
streaming jobs?

I want to pose this as a general question, but I'm working with Pubsub and
Dataflow specifically. I am a fan of the idea of replaying/fast forwarding
through historic data to reproduce results (as you perhaps would with
Kafka), but I'm having a hard time unifying this way of thinking with the
concepts of watermarks and late data in Beam. I'm not sure how to best
mimic this with the tools I'm using, or if there is a better way.

If there is a previous discussion about this I might have missed (and I'm
guessing there is), please direct me to it!


The use case:

Suppose I discover a bug in a streaming job with event time windows and an
allowed lateness of 7 days, and that I subsequently have to reprocess all
the data for the past month. Let us also assume that I have an archive of
my source data (in my case in Google cloud storage) and that I can
republish it all to the message queue I'm using.

Some ideas that may or may not work I would love to get your thoughts on:

1) Start a new instance of the job that reads from a separate source to
which I republish all messages. This shouldn't work because 14 days of my
data is later than the allowed limit, buy the remaining 7 days should be
reprocessed as intended.

2) The same as 1), but with allowed lateness of one month. When the job is
caught up, the lateness can be adjusted back to 7 days. I am afraid this
approach may consume too much memory since I'm letting a whole month of
windows remain in memory. Also I wouldn't get the same triggering behaviour
as in the original job since most or all of the data is late with respect
to the watermark, which I assume is near real time when the historic data
enters the pipeline.

3) The same as 1), but with the republishing first and only starting the
new job when all messages are already waiting in the queue. The watermark
should then start one month back in time and only catch up with the present
once all the data is reprocessed, yielding no late data. (Experiments I've
done with this approach produce somewhat unexpected results where early
panes that are older than 7 days appear to be both the first and the last
firing from their respective windows.) Early firings triggered by
processing time would probably differ by the results should be the same?
This approach also feels a bit awkward as it requires more orchestration.

4) Batch process the archived data instead and start a streaming job in
parallel. Would this in a sense be a more honest approach since I'm
actually reprocessing batches of archived data? The triggering behaviour in
the streaming version of the job would not apply in batch, and I would want
to avoid stitching together results from two jobs if I can.


These are the approaches I've thought of currently, and any input is much
appreciated.  Have any of you faced similar situations, and how did you
solve them?


Regards,
Lars

Reply via email to