Re: Reprocessing historic data with streaming jobs

Lars BK Mon, 01 May 2017 09:30:43 -0700

I did not see Lukasz reply before I posted, and I will have to read it a
bit later!


man. 1. mai 2017 kl. 18.28 skrev Lars BK <[email protected]>:

> Yes, precisely.
>
> I think that could work, yes. What you are suggesting sounds like idea 2)
> in my original question.
>
> My main concern is that I would have to allow a great deal of lateness and
> that old windows would consume too much memory. Whether it works in my case
> or not I don't know yet as I haven't tested it.
>
> What if I had to process even older data? Could I handle any "oldness" of
> data by increasing the allowed lateness and throwing machines at the
> problem to hold all the old windows in memory while the backlog is
> processed? If so, great! But I would have to dial the allowed lateness back
> down when the processing has caught up with the present.
>
> Is there some intended way of handling reprocessing like this? Maybe not?
> Perhaps it is more of a Pubsub and Dataflow question than a Beam question
> when it comes down to it.
>
>
> man. 1. mai 2017 kl. 17.25 skrev Jean-Baptiste Onofré <[email protected]>:
>
>> OK, so the messages are "re-publish" on the topic, with the same
>> timestamp as
>> the original and consume again by the pipeline.
>>
>> Maybe, you can play with the allowed lateness and late firings ?
>>
>> Something like:
>>
>>            Window.into(FixedWindows.of(Duration.minutes(xx)))
>>                .triggering(AfterWatermark.pastEndOfWindow()
>>
>>  .withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()
>>                        .plusDelayOf(FIVE_MINUTES))
>>
>>  .withLateFirings(AfterProcessingTime.pastFirstElementInPane()
>>                        .plusDelayOf(TEN_MINUTES)))
>>                .withAllowedLateness(Duration.minutes()
>>                .accumulatingFiredPanes())
>>
>> Thoughts ?
>>
>> Regards
>> JB
>>
>> On 05/01/2017 05:12 PM, Lars BK wrote:
>> > Hi Jean-Baptiste,
>> >
>> > I think the key point in my case is that I have to process or reprocess
>> "old"
>> > messages. That is, messages that are late because they are streamed
>> from an
>> > archive file and are older than the allowed lateness in the pipeline.
>> >
>> > In the case I described the messages had already been processed once
>> and no
>> > longer in the topic, so they had to be sent and processed again. But it
>> might as
>> > well have been that I had received a backfill of data that absolutely
>> needs to
>> > be processed regardless of it being later than the allowed lateness
>> with respect
>> > to present time.
>> >
>> > So when I write this now it really sounds like I either need to allow
>> more
>> > lateness or somehow rewind the watermark!
>> >
>> > Lars
>> >
>> > man. 1. mai 2017 kl. 16.34 skrev Jean-Baptiste Onofré <[email protected]
>> > <mailto:[email protected]>>:
>> >
>> >     Hi Lars,
>> >
>> >     interesting use case indeed ;)
>> >
>> >     Just to understand: if possible, you don't want to re-consume the
>> messages from
>> >     the PubSub topic right ? So, you want to "hold" the PCollections
>> for late data
>> >     processing ?
>> >
>> >     Regards
>> >     JB
>> >
>> >     On 05/01/2017 04:15 PM, Lars BK wrote:
>> >     > Hi,
>> >     >
>> >     > Is there a preferred way of approaching reprocessing historic
>> data with
>> >     > streaming jobs?
>> >     >
>> >     > I want to pose this as a general question, but I'm working with
>> Pubsub and
>> >     > Dataflow specifically. I am a fan of the idea of replaying/fast
>> forwarding
>> >     > through historic data to reproduce results (as you perhaps would
>> with Kafka),
>> >     > but I'm having a hard time unifying this way of thinking with the
>> concepts of
>> >     > watermarks and late data in Beam. I'm not sure how to best mimic
>> this with the
>> >     > tools I'm using, or if there is a better way.
>> >     >
>> >     > If there is a previous discussion about this I might have missed
>> (and I'm
>> >     > guessing there is), please direct me to it!
>> >     >
>> >     >
>> >     > The use case:
>> >     >
>> >     > Suppose I discover a bug in a streaming job with event time
>> windows and an
>> >     > allowed lateness of 7 days, and that I subsequently have to
>> reprocess all the
>> >     > data for the past month. Let us also assume that I have an
>> archive of my
>> >     source
>> >     > data (in my case in Google cloud storage) and that I can
>> republish it all
>> >     to the
>> >     > message queue I'm using.
>> >     >
>> >     > Some ideas that may or may not work I would love to get your
>> thoughts on:
>> >     >
>> >     > 1) Start a new instance of the job that reads from a separate
>> source to
>> >     which I
>> >     > republish all messages. This shouldn't work because 14 days of my
>> data is
>> >     later
>> >     > than the allowed limit, buy the remaining 7 days should be
>> reprocessed as
>> >     intended.
>> >     >
>> >     > 2) The same as 1), but with allowed lateness of one month. When
>> the job is
>> >     > caught up, the lateness can be adjusted back to 7 days. I am
>> afraid this
>> >     > approach may consume too much memory since I'm letting a whole
>> month of
>> >     windows
>> >     > remain in memory. Also I wouldn't get the same triggering
>> behaviour as in the
>> >     > original job since most or all of the data is late with respect
>> to the
>> >     > watermark, which I assume is near real time when the historic
>> data enters the
>> >     > pipeline.
>> >     >
>> >     > 3) The same as 1), but with the republishing first and only
>> starting the
>> >     new job
>> >     > when all messages are already waiting in the queue. The watermark
>> should then
>> >     > start one month back in time and only catch up with the present
>> once all the
>> >     > data is reprocessed, yielding no late data. (Experiments I've
>> done with this
>> >     > approach produce somewhat unexpected results where early panes
>> that are older
>> >     > than 7 days appear to be both the first and the last firing from
>> their
>> >     > respective windows.) Early firings triggered by processing time
>> would probably
>> >     > differ by the results should be the same? This approach also
>> feels a bit
>> >     awkward
>> >     > as it requires more orchestration.
>> >     >
>> >     > 4) Batch process the archived data instead and start a streaming
>> job in
>> >     > parallel. Would this in a sense be a more honest approach since
>> I'm actually
>> >     > reprocessing batches of archived data? The triggering behaviour
>> in the
>> >     streaming
>> >     > version of the job would not apply in batch, and I would want to
>> avoid
>> >     stitching
>> >     > together results from two jobs if I can.
>> >     >
>> >     >
>> >     > These are the approaches I've thought of currently, and any input
>> is much
>> >     > appreciated.  Have any of you faced similar situations, and how
>> did you
>> >     solve them?
>> >     >
>> >     >
>> >     > Regards,
>> >     > Lars
>> >     >
>> >     >
>> >
>> >     --
>> >     Jean-Baptiste Onofré
>> >     [email protected] <mailto:[email protected]>
>> >     http://blog.nanthrax.net
>> >     Talend - http://www.talend.com
>> >
>>
>> --
>> Jean-Baptiste Onofré
>> [email protected]
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>

Re: Reprocessing historic data with streaming jobs

Reply via email to