Re: Using Dataflow with Pubsub input connector in batch mode

Sumit Desai via user Sun, 21 Jan 2024 20:29:45 -0800

Thanks Reuven and Alex. Yes, we are considering specifying the max time to
read to the Pub/sub input connector first. If it doesn't work out due to
some reason, will consider the approach with GCS. Thanks for your inputs.


Regards,
Sumit Desai



On Mon, Jan 22, 2024 at 4:13 AM Reuven Lax via user <user@beam.apache.org>
wrote:

> Cloud Storage subscriptions are a reasonable way to backup data to
> storage, and you can then run a batch pipeline over the GCS files. Keep in
> mind that these files might contain duplicates (the storage subscriptions
> do not guarantee exactly-once writes). If this is a problem, you should add
> a deduplication stage to the batch job that processes these files.
>
> On Sun, Jan 21, 2024 at 2:45 AM Alex Van Boxel <a...@vanboxel.be> wrote:
>
>> There are some valid use cases where you want to handle data going over
>> Pubsub to handle in batch. It's way too expensive to run a simple daily
>> extract from the data over Pubsub; batch is a lot cheaper.
>>
>> What we do is backup the data to Cloud Storage; Pubsub has recently added
>> a nice feature that can help you:
>>
>>    - https://cloud.google.com/pubsub/docs/cloudstorage
>>    -
>>    
>> https://cloud.google.com/pubsub/docs/create-cloudstorage-subscription#subscription_properties
>>
>> This reduced our cost dramatically. We had a Dataflow doing the backup to
>> Cloud Storage, but the above feature is way cheaper. Use the export to Avro
>> (the schema is in the second link), and then your batch beam pipeline input
>> is a bounded input.
>>
>>  _/
>> _/ Alex Van Boxel
>>
>>
>> On Fri, Jan 19, 2024 at 12:18 AM Reuven Lax via user <
>> user@beam.apache.org> wrote:
>>
>>> Some comments here:
>>>    1. All messages in a PubSub topic is not a well-defined statement, as
>>> there can always be more messages published. You may know that nobody will
>>> publish any more messages, but the pipeline does not.
>>>    2. While it's possible to read from Pub/Sub in batch, it's usually
>>> not recommended. For one thing I don't think that the batch runner can
>>> maintain exactly-once processing when reading from Pub/Sub.
>>>    3. In Java you can turn an unbounded source (Pub/Sub) into a bounded
>>> source that can in theory be used for batch jobs. However this is done by
>>> specifying either the max time to read or the max number of messages. I
>>> don't think there's any way to automatically read the Pub/Sub topic until
>>> there are no more messages in it.
>>>
>>> Reuven
>>>
>>> On Thu, Jan 18, 2024 at 2:25 AM Sumit Desai via user <
>>> user@beam.apache.org> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I want to create a Dataflow pipeline using Pub/sub as an input
>>>> connector but I want to run it in batch mode and not streaming mode. I know
>>>> it's not possible in Python but how can I achieve this in Java? Basically,
>>>> I want my pipeline to read all messages in a Pubsub topic, process and
>>>> terminate. Please suggest.
>>>>
>>>> Thanks & Regards,
>>>> Sumit Desai
>>>>
>>>

Re: Using Dataflow with Pubsub input connector in batch mode

Reply via email to