It also makes sense to have separate jobs for reprocessing historical data,
because they will have different scale requirements. And batch can more
efficiently parallelize.

On Wed, 25 Aug 2021, 09:02 , <[email protected]> wrote:

> Have an option, and make the main IO suitable for pub sub or bq.
>
> You can't use the same running streaming job for the watermark reason you
> mentioned.
>
> You should be able to have to entire pipeline identical (basically in a
> single composite ptransform) except the input pcollection.
>
> On Wed, 25 Aug 2021, 08:24 Evan Galpin, <[email protected]> wrote:
>
>> Hi Kishore,
>>
>> You may be able to introduce a new pipeline which reads from BQ and
>> publishes to PubSub like you mentioned. By default, elements read via
>> PubSub will have a timestamp which is equal to the publish time of the
>> message (internally established by PubSub service).
>>
>> Are you using a custom withTimestampAttribute? If not, I believe there
>> should be no issue with watermark and late data. As per javadoc for PubSub
>> Read [1]:
>>
>> > Note that the system can guarantee that no late data will ever be seen
>> when it assigns timestamps by arrival time (i.e. timestampAttribute is
>> not provided).
>>
>> [1]
>>
>> https://beam.apache.org/releases/javadoc/2.31.0/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.Read.html#withTimestampAttribute-java.lang.String-
>>
>> Hope that helps.
>>
>> Thanks,
>> Evan
>>
>> On Tue, Aug 24, 2021 at 11:16 KV 59 <[email protected]> wrote:
>>
>>> Hi Ankur,
>>>
>>> Thanks for the response. I have a follow up to option 1 & 2
>>>
>>> If I were able to stream the historical data to PubSub and be able to
>>> restart the job, even then I believe it will not work because of the
>>> watermarks, am I right?
>>>
>>> As for option 2, if I were able to make an incompatible change (by doing
>>> a hard restart) even then it would be a challenge because of the
>>> watermarks.
>>>
>>> Thanks
>>> Kishore
>>>
>>>
>>>
>>> On Tue, Aug 24, 2021 at 8:00 AM Ankur Goenka <[email protected]> wrote:
>>>
>>>> I suppose historic data Historical data processing will be one time
>>>> activity so it will be best to use a batch job to process historical data.
>>>> As for the options you mentioned,
>>>> Option 1 is not feasible as you will have to update the pipeline and I
>>>> believe the update will not be compatible because of the source change.
>>>> Option 2 also requires changes to 1st job which can be done using
>>>> update or drain and restart so has the same problem while being more
>>>> complicated.
>>>>
>>>> Thanks,
>>>> Ankur
>>>>
>>>> On Tue, Aug 24, 2021 at 7:44 AM KV 59 <[email protected]> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have a Beam streaming pipeline processing live data from PubSub
>>>>> using sliding windows on event timestamps. I want to recompute the metrics
>>>>> for historical data in BigQuery. What are my options?
>>>>>
>>>>> I have looked at
>>>>> https://stackoverflow.com/questions/56702878/how-to-use-apache-beam-to-process-historic-time-series-data
>>>>> and I have a couple of questions
>>>>>
>>>>> 1. Can I use the same instance of the streaming pipeline? I don't
>>>>> think so as the watermark would be way past the historical event
>>>>> timestamps.
>>>>>
>>>>> 2. Could I possibly split the pipeline and use one branch for
>>>>> historical data and one for the live streaming data?
>>>>>
>>>>> I am trying hard not to raise parallel infrastructure to process
>>>>> historical data.
>>>>>
>>>>> Any inputs would be very much appreciated
>>>>>
>>>>> Thanks
>>>>> Kishore
>>>>>
>>>>

Reply via email to