Re: Using Dataflow with Pubsub input connector in batch mode

Reuven Lax via user Sun, 21 Jan 2024 14:42:50 -0800

Cloud Storage subscriptions are a reasonable way to backup data to storage,
and you can then run a batch pipeline over the GCS files. Keep in mind that
these files might contain duplicates (the storage subscriptions do not
guarantee exactly-once writes). If this is a problem, you should add a
deduplication stage to the batch job that processes these files.


On Sun, Jan 21, 2024 at 2:45 AM Alex Van Boxel <[email protected]> wrote:

> There are some valid use cases where you want to handle data going over
> Pubsub to handle in batch. It's way too expensive to run a simple daily
> extract from the data over Pubsub; batch is a lot cheaper.
>
> What we do is backup the data to Cloud Storage; Pubsub has recently added
> a nice feature that can help you:
>
>    - https://cloud.google.com/pubsub/docs/cloudstorage
>    -
>    
> https://cloud.google.com/pubsub/docs/create-cloudstorage-subscription#subscription_properties
>
> This reduced our cost dramatically. We had a Dataflow doing the backup to
> Cloud Storage, but the above feature is way cheaper. Use the export to Avro
> (the schema is in the second link), and then your batch beam pipeline input
> is a bounded input.
>
>  _/
> _/ Alex Van Boxel
>
>
> On Fri, Jan 19, 2024 at 12:18 AM Reuven Lax via user <[email protected]>
> wrote:
>
>> Some comments here:
>>    1. All messages in a PubSub topic is not a well-defined statement, as
>> there can always be more messages published. You may know that nobody will
>> publish any more messages, but the pipeline does not.
>>    2. While it's possible to read from Pub/Sub in batch, it's usually not
>> recommended. For one thing I don't think that the batch runner can maintain
>> exactly-once processing when reading from Pub/Sub.
>>    3. In Java you can turn an unbounded source (Pub/Sub) into a bounded
>> source that can in theory be used for batch jobs. However this is done by
>> specifying either the max time to read or the max number of messages. I
>> don't think there's any way to automatically read the Pub/Sub topic until
>> there are no more messages in it.
>>
>> Reuven
>>
>> On Thu, Jan 18, 2024 at 2:25 AM Sumit Desai via user <
>> [email protected]> wrote:
>>
>>> Hi all,
>>>
>>> I want to create a Dataflow pipeline using Pub/sub as an input connector
>>> but I want to run it in batch mode and not streaming mode. I know it's not
>>> possible in Python but how can I achieve this in Java? Basically, I want my
>>> pipeline to read all messages in a Pubsub topic, process and terminate.
>>> Please suggest.
>>>
>>> Thanks & Regards,
>>> Sumit Desai
>>>
>>

Re: Using Dataflow with Pubsub input connector in batch mode

Reply via email to