Hi Ankur, Thanks for the response. I have a follow up to option 1 & 2
If I were able to stream the historical data to PubSub and be able to restart the job, even then I believe it will not work because of the watermarks, am I right? As for option 2, if I were able to make an incompatible change (by doing a hard restart) even then it would be a challenge because of the watermarks. Thanks Kishore On Tue, Aug 24, 2021 at 8:00 AM Ankur Goenka <[email protected]> wrote: > I suppose historic data Historical data processing will be one time > activity so it will be best to use a batch job to process historical data. > As for the options you mentioned, > Option 1 is not feasible as you will have to update the pipeline and I > believe the update will not be compatible because of the source change. > Option 2 also requires changes to 1st job which can be done using update > or drain and restart so has the same problem while being more complicated. > > Thanks, > Ankur > > On Tue, Aug 24, 2021 at 7:44 AM KV 59 <[email protected]> wrote: > >> Hi, >> >> I have a Beam streaming pipeline processing live data from PubSub using >> sliding windows on event timestamps. I want to recompute the metrics for >> historical data in BigQuery. What are my options? >> >> I have looked at >> https://stackoverflow.com/questions/56702878/how-to-use-apache-beam-to-process-historic-time-series-data >> and I have a couple of questions >> >> 1. Can I use the same instance of the streaming pipeline? I don't think >> so as the watermark would be way past the historical event timestamps. >> >> 2. Could I possibly split the pipeline and use one branch for historical >> data and one for the live streaming data? >> >> I am trying hard not to raise parallel infrastructure to process >> historical data. >> >> Any inputs would be very much appreciated >> >> Thanks >> Kishore >> >
