Hi Kishore, You may be able to introduce a new pipeline which reads from BQ and publishes to PubSub like you mentioned. By default, elements read via PubSub will have a timestamp which is equal to the publish time of the message (internally established by PubSub service).
Are you using a custom withTimestampAttribute? If not, I believe there should be no issue with watermark and late data. As per javadoc for PubSub Read [1]: > Note that the system can guarantee that no late data will ever be seen when it assigns timestamps by arrival time (i.e. timestampAttribute is not provided). [1] https://beam.apache.org/releases/javadoc/2.31.0/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.Read.html#withTimestampAttribute-java.lang.String- Hope that helps. Thanks, Evan On Tue, Aug 24, 2021 at 11:16 KV 59 <[email protected]> wrote: > Hi Ankur, > > Thanks for the response. I have a follow up to option 1 & 2 > > If I were able to stream the historical data to PubSub and be able to > restart the job, even then I believe it will not work because of the > watermarks, am I right? > > As for option 2, if I were able to make an incompatible change (by doing a > hard restart) even then it would be a challenge because of the watermarks. > > Thanks > Kishore > > > > On Tue, Aug 24, 2021 at 8:00 AM Ankur Goenka <[email protected]> wrote: > >> I suppose historic data Historical data processing will be one time >> activity so it will be best to use a batch job to process historical data. >> As for the options you mentioned, >> Option 1 is not feasible as you will have to update the pipeline and I >> believe the update will not be compatible because of the source change. >> Option 2 also requires changes to 1st job which can be done using update >> or drain and restart so has the same problem while being more complicated. >> >> Thanks, >> Ankur >> >> On Tue, Aug 24, 2021 at 7:44 AM KV 59 <[email protected]> wrote: >> >>> Hi, >>> >>> I have a Beam streaming pipeline processing live data from PubSub using >>> sliding windows on event timestamps. I want to recompute the metrics for >>> historical data in BigQuery. What are my options? >>> >>> I have looked at >>> https://stackoverflow.com/questions/56702878/how-to-use-apache-beam-to-process-historic-time-series-data >>> and I have a couple of questions >>> >>> 1. Can I use the same instance of the streaming pipeline? I don't think >>> so as the watermark would be way past the historical event timestamps. >>> >>> 2. Could I possibly split the pipeline and use one branch for historical >>> data and one for the live streaming data? >>> >>> I am trying hard not to raise parallel infrastructure to process >>> historical data. >>> >>> Any inputs would be very much appreciated >>> >>> Thanks >>> Kishore >>> >>
