Hi Lars,

interesting use case indeed ;)

Just to understand: if possible, you don't want to re-consume the messages from the PubSub topic right ? So, you want to "hold" the PCollections for late data processing ?

Regards
JB

On 05/01/2017 04:15 PM, Lars BK wrote:
Hi,

Is there a preferred way of approaching reprocessing historic data with
streaming jobs?

I want to pose this as a general question, but I'm working with Pubsub and
Dataflow specifically. I am a fan of the idea of replaying/fast forwarding
through historic data to reproduce results (as you perhaps would with Kafka),
but I'm having a hard time unifying this way of thinking with the concepts of
watermarks and late data in Beam. I'm not sure how to best mimic this with the
tools I'm using, or if there is a better way.

If there is a previous discussion about this I might have missed (and I'm
guessing there is), please direct me to it!


The use case:

Suppose I discover a bug in a streaming job with event time windows and an
allowed lateness of 7 days, and that I subsequently have to reprocess all the
data for the past month. Let us also assume that I have an archive of my source
data (in my case in Google cloud storage) and that I can republish it all to the
message queue I'm using.

Some ideas that may or may not work I would love to get your thoughts on:

1) Start a new instance of the job that reads from a separate source to which I
republish all messages. This shouldn't work because 14 days of my data is later
than the allowed limit, buy the remaining 7 days should be reprocessed as 
intended.

2) The same as 1), but with allowed lateness of one month. When the job is
caught up, the lateness can be adjusted back to 7 days. I am afraid this
approach may consume too much memory since I'm letting a whole month of windows
remain in memory. Also I wouldn't get the same triggering behaviour as in the
original job since most or all of the data is late with respect to the
watermark, which I assume is near real time when the historic data enters the
pipeline.

3) The same as 1), but with the republishing first and only starting the new job
when all messages are already waiting in the queue. The watermark should then
start one month back in time and only catch up with the present once all the
data is reprocessed, yielding no late data. (Experiments I've done with this
approach produce somewhat unexpected results where early panes that are older
than 7 days appear to be both the first and the last firing from their
respective windows.) Early firings triggered by processing time would probably
differ by the results should be the same? This approach also feels a bit awkward
as it requires more orchestration.

4) Batch process the archived data instead and start a streaming job in
parallel. Would this in a sense be a more honest approach since I'm actually
reprocessing batches of archived data? The triggering behaviour in the streaming
version of the job would not apply in batch, and I would want to avoid stitching
together results from two jobs if I can.


These are the approaches I've thought of currently, and any input is much
appreciated.  Have any of you faced similar situations, and how did you solve 
them?


Regards,
Lars



--
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com

Reply via email to