Re: Processing historical data

trsell Tue, 24 Aug 2021 18:04:38 -0700

It also makes sense to have separate jobs for reprocessing historical data,
because they will have different scale requirements. And batch can more
efficiently parallelize.


On Wed, 25 Aug 2021, 09:02 , <[email protected]> wrote:

> Have an option, and make the main IO suitable for pub sub or bq.
>
> You can't use the same running streaming job for the watermark reason you
> mentioned.
>
> You should be able to have to entire pipeline identical (basically in a
> single composite ptransform) except the input pcollection.
>
> On Wed, 25 Aug 2021, 08:24 Evan Galpin, <[email protected]> wrote:
>
>> Hi Kishore,
>>
>> You may be able to introduce a new pipeline which reads from BQ and
>> publishes to PubSub like you mentioned. By default, elements read via
>> PubSub will have a timestamp which is equal to the publish time of the
>> message (internally established by PubSub service).
>>
>> Are you using a custom withTimestampAttribute? If not, I believe there
>> should be no issue with watermark and late data. As per javadoc for PubSub
>> Read [1]:
>>
>> > Note that the system can guarantee that no late data will ever be seen
>> when it assigns timestamps by arrival time (i.e. timestampAttribute is
>> not provided).
>>
>> [1]
>>
>> https://beam.apache.org/releases/javadoc/2.31.0/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.Read.html#withTimestampAttribute-java.lang.String-
>>
>> Hope that helps.
>>
>> Thanks,
>> Evan
>>
>> On Tue, Aug 24, 2021 at 11:16 KV 59 <[email protected]> wrote:
>>
>>> Hi Ankur,
>>>
>>> Thanks for the response. I have a follow up to option 1 & 2
>>>
>>> If I were able to stream the historical data to PubSub and be able to
>>> restart the job, even then I believe it will not work because of the
>>> watermarks, am I right?
>>>
>>> As for option 2, if I were able to make an incompatible change (by doing
>>> a hard restart) even then it would be a challenge because of the
>>> watermarks.
>>>
>>> Thanks
>>> Kishore
>>>
>>>
>>>
>>> On Tue, Aug 24, 2021 at 8:00 AM Ankur Goenka <[email protected]> wrote:
>>>
>>>> I suppose historic data Historical data processing will be one time
>>>> activity so it will be best to use a batch job to process historical data.
>>>> As for the options you mentioned,
>>>> Option 1 is not feasible as you will have to update the pipeline and I
>>>> believe the update will not be compatible because of the source change.
>>>> Option 2 also requires changes to 1st job which can be done using
>>>> update or drain and restart so has the same problem while being more
>>>> complicated.
>>>>
>>>> Thanks,
>>>> Ankur
>>>>
>>>> On Tue, Aug 24, 2021 at 7:44 AM KV 59 <[email protected]> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have a Beam streaming pipeline processing live data from PubSub
>>>>> using sliding windows on event timestamps. I want to recompute the metrics
>>>>> for historical data in BigQuery. What are my options?
>>>>>
>>>>> I have looked at
>>>>> https://stackoverflow.com/questions/56702878/how-to-use-apache-beam-to-process-historic-time-series-data
>>>>> and I have a couple of questions
>>>>>
>>>>> 1. Can I use the same instance of the streaming pipeline? I don't
>>>>> think so as the watermark would be way past the historical event
>>>>> timestamps.
>>>>>
>>>>> 2. Could I possibly split the pipeline and use one branch for
>>>>> historical data and one for the live streaming data?
>>>>>
>>>>> I am trying hard not to raise parallel infrastructure to process
>>>>> historical data.
>>>>>
>>>>> Any inputs would be very much appreciated
>>>>>
>>>>> Thanks
>>>>> Kishore
>>>>>
>>>>

Re: Processing historical data

Reply via email to