Cloud Storage subscriptions are a reasonable way to backup data to storage, and you can then run a batch pipeline over the GCS files. Keep in mind that these files might contain duplicates (the storage subscriptions do not guarantee exactly-once writes). If this is a problem, you should add a deduplication stage to the batch job that processes these files.
On Sun, Jan 21, 2024 at 2:45 AM Alex Van Boxel <[email protected]> wrote: > There are some valid use cases where you want to handle data going over > Pubsub to handle in batch. It's way too expensive to run a simple daily > extract from the data over Pubsub; batch is a lot cheaper. > > What we do is backup the data to Cloud Storage; Pubsub has recently added > a nice feature that can help you: > > - https://cloud.google.com/pubsub/docs/cloudstorage > - > > https://cloud.google.com/pubsub/docs/create-cloudstorage-subscription#subscription_properties > > This reduced our cost dramatically. We had a Dataflow doing the backup to > Cloud Storage, but the above feature is way cheaper. Use the export to Avro > (the schema is in the second link), and then your batch beam pipeline input > is a bounded input. > > _/ > _/ Alex Van Boxel > > > On Fri, Jan 19, 2024 at 12:18 AM Reuven Lax via user <[email protected]> > wrote: > >> Some comments here: >> 1. All messages in a PubSub topic is not a well-defined statement, as >> there can always be more messages published. You may know that nobody will >> publish any more messages, but the pipeline does not. >> 2. While it's possible to read from Pub/Sub in batch, it's usually not >> recommended. For one thing I don't think that the batch runner can maintain >> exactly-once processing when reading from Pub/Sub. >> 3. In Java you can turn an unbounded source (Pub/Sub) into a bounded >> source that can in theory be used for batch jobs. However this is done by >> specifying either the max time to read or the max number of messages. I >> don't think there's any way to automatically read the Pub/Sub topic until >> there are no more messages in it. >> >> Reuven >> >> On Thu, Jan 18, 2024 at 2:25 AM Sumit Desai via user < >> [email protected]> wrote: >> >>> Hi all, >>> >>> I want to create a Dataflow pipeline using Pub/sub as an input connector >>> but I want to run it in batch mode and not streaming mode. I know it's not >>> possible in Python but how can I achieve this in Java? Basically, I want my >>> pipeline to read all messages in a Pubsub topic, process and terminate. >>> Please suggest. >>> >>> Thanks & Regards, >>> Sumit Desai >>> >>
