Re: Processing historical data

Evan Galpin Tue, 24 Aug 2021 17:24:57 -0700

Hi Kishore,

You may be able to introduce a new pipeline which reads from BQ and
publishes to PubSub like you mentioned. By default, elements read via
PubSub will have a timestamp which is equal to the publish time of the
message (internally established by PubSub service).


Are you using a custom withTimestampAttribute? If not, I believe there
should be no issue with watermark and late data. As per javadoc for PubSub
Read [1]:

> Note that the system can guarantee that no late data will ever be seen
when it assigns timestamps by arrival time (i.e. timestampAttribute is not
provided).

[1]
https://beam.apache.org/releases/javadoc/2.31.0/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.Read.html#withTimestampAttribute-java.lang.String-

Hope that helps.

Thanks,
Evan

On Tue, Aug 24, 2021 at 11:16 KV 59 <[email protected]> wrote:

> Hi Ankur,
>
> Thanks for the response. I have a follow up to option 1 & 2
>
> If I were able to stream the historical data to PubSub and be able to
> restart the job, even then I believe it will not work because of the
> watermarks, am I right?
>
> As for option 2, if I were able to make an incompatible change (by doing a
> hard restart) even then it would be a challenge because of the watermarks.
>
> Thanks
> Kishore
>
>
>
> On Tue, Aug 24, 2021 at 8:00 AM Ankur Goenka <[email protected]> wrote:
>
>> I suppose historic data Historical data processing will be one time
>> activity so it will be best to use a batch job to process historical data.
>> As for the options you mentioned,
>> Option 1 is not feasible as you will have to update the pipeline and I
>> believe the update will not be compatible because of the source change.
>> Option 2 also requires changes to 1st job which can be done using update
>> or drain and restart so has the same problem while being more complicated.
>>
>> Thanks,
>> Ankur
>>
>> On Tue, Aug 24, 2021 at 7:44 AM KV 59 <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> I have a Beam streaming pipeline processing live data from PubSub using
>>> sliding windows on event timestamps. I want to recompute the metrics for
>>> historical data in BigQuery. What are my options?
>>>
>>> I have looked at
>>> https://stackoverflow.com/questions/56702878/how-to-use-apache-beam-to-process-historic-time-series-data
>>> and I have a couple of questions
>>>
>>> 1. Can I use the same instance of the streaming pipeline? I don't think
>>> so as the watermark would be way past the historical event timestamps.
>>>
>>> 2. Could I possibly split the pipeline and use one branch for historical
>>> data and one for the live streaming data?
>>>
>>> I am trying hard not to raise parallel infrastructure to process
>>> historical data.
>>>
>>> Any inputs would be very much appreciated
>>>
>>> Thanks
>>> Kishore
>>>
>>

Re: Processing historical data

Reply via email to