It also makes sense to have separate jobs for reprocessing historical data, because they will have different scale requirements. And batch can more efficiently parallelize.
On Wed, 25 Aug 2021, 09:02 , <[email protected]> wrote: > Have an option, and make the main IO suitable for pub sub or bq. > > You can't use the same running streaming job for the watermark reason you > mentioned. > > You should be able to have to entire pipeline identical (basically in a > single composite ptransform) except the input pcollection. > > On Wed, 25 Aug 2021, 08:24 Evan Galpin, <[email protected]> wrote: > >> Hi Kishore, >> >> You may be able to introduce a new pipeline which reads from BQ and >> publishes to PubSub like you mentioned. By default, elements read via >> PubSub will have a timestamp which is equal to the publish time of the >> message (internally established by PubSub service). >> >> Are you using a custom withTimestampAttribute? If not, I believe there >> should be no issue with watermark and late data. As per javadoc for PubSub >> Read [1]: >> >> > Note that the system can guarantee that no late data will ever be seen >> when it assigns timestamps by arrival time (i.e. timestampAttribute is >> not provided). >> >> [1] >> >> https://beam.apache.org/releases/javadoc/2.31.0/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.Read.html#withTimestampAttribute-java.lang.String- >> >> Hope that helps. >> >> Thanks, >> Evan >> >> On Tue, Aug 24, 2021 at 11:16 KV 59 <[email protected]> wrote: >> >>> Hi Ankur, >>> >>> Thanks for the response. I have a follow up to option 1 & 2 >>> >>> If I were able to stream the historical data to PubSub and be able to >>> restart the job, even then I believe it will not work because of the >>> watermarks, am I right? >>> >>> As for option 2, if I were able to make an incompatible change (by doing >>> a hard restart) even then it would be a challenge because of the >>> watermarks. >>> >>> Thanks >>> Kishore >>> >>> >>> >>> On Tue, Aug 24, 2021 at 8:00 AM Ankur Goenka <[email protected]> wrote: >>> >>>> I suppose historic data Historical data processing will be one time >>>> activity so it will be best to use a batch job to process historical data. >>>> As for the options you mentioned, >>>> Option 1 is not feasible as you will have to update the pipeline and I >>>> believe the update will not be compatible because of the source change. >>>> Option 2 also requires changes to 1st job which can be done using >>>> update or drain and restart so has the same problem while being more >>>> complicated. >>>> >>>> Thanks, >>>> Ankur >>>> >>>> On Tue, Aug 24, 2021 at 7:44 AM KV 59 <[email protected]> wrote: >>>> >>>>> Hi, >>>>> >>>>> I have a Beam streaming pipeline processing live data from PubSub >>>>> using sliding windows on event timestamps. I want to recompute the metrics >>>>> for historical data in BigQuery. What are my options? >>>>> >>>>> I have looked at >>>>> https://stackoverflow.com/questions/56702878/how-to-use-apache-beam-to-process-historic-time-series-data >>>>> and I have a couple of questions >>>>> >>>>> 1. Can I use the same instance of the streaming pipeline? I don't >>>>> think so as the watermark would be way past the historical event >>>>> timestamps. >>>>> >>>>> 2. Could I possibly split the pipeline and use one branch for >>>>> historical data and one for the live streaming data? >>>>> >>>>> I am trying hard not to raise parallel infrastructure to process >>>>> historical data. >>>>> >>>>> Any inputs would be very much appreciated >>>>> >>>>> Thanks >>>>> Kishore >>>>> >>>>
