Re: Processing historical data

KV 59 Tue, 24 Aug 2021 08:16:59 -0700

Hi Ankur,

Thanks for the response. I have a follow up to option 1 & 2


If I were able to stream the historical data to PubSub and be able to
restart the job, even then I believe it will not work because of the
watermarks, am I right?

As for option 2, if I were able to make an incompatible change (by doing a
hard restart) even then it would be a challenge because of the watermarks.

Thanks
Kishore



On Tue, Aug 24, 2021 at 8:00 AM Ankur Goenka <[email protected]> wrote:

> I suppose historic data Historical data processing will be one time
> activity so it will be best to use a batch job to process historical data.
> As for the options you mentioned,
> Option 1 is not feasible as you will have to update the pipeline and I
> believe the update will not be compatible because of the source change.
> Option 2 also requires changes to 1st job which can be done using update
> or drain and restart so has the same problem while being more complicated.
>
> Thanks,
> Ankur
>
> On Tue, Aug 24, 2021 at 7:44 AM KV 59 <[email protected]> wrote:
>
>> Hi,
>>
>> I have a Beam streaming pipeline processing live data from PubSub using
>> sliding windows on event timestamps. I want to recompute the metrics for
>> historical data in BigQuery. What are my options?
>>
>> I have looked at
>> https://stackoverflow.com/questions/56702878/how-to-use-apache-beam-to-process-historic-time-series-data
>> and I have a couple of questions
>>
>> 1. Can I use the same instance of the streaming pipeline? I don't think
>> so as the watermark would be way past the historical event timestamps.
>>
>> 2. Could I possibly split the pipeline and use one branch for historical
>> data and one for the live streaming data?
>>
>> I am trying hard not to raise parallel infrastructure to process
>> historical data.
>>
>> Any inputs would be very much appreciated
>>
>> Thanks
>> Kishore
>>
>

Re: Processing historical data

Reply via email to