Does withkeys transform enforce a reshuffle?

2024-01-18 Thread hsy...@gmail.com
Hey guys,

I have a question, does withkeys transformation enforce a reshuffle?

My pipeline basically look like this PubsubLiteIO -> ParDo(..) -> ParDo()
-> BigqueryIO.write()

The problem is PubsubLiteIO -> ParDo(..) -> ParDo() always fused together.
But The ParDo is expensive and I want dataflow to have more workers to work
on that, what's the best way to do that?

Regards,


Re: Using Dataflow with Pubsub input connector in batch mode

2024-01-18 Thread Reuven Lax via user
Some comments here:
   1. All messages in a PubSub topic is not a well-defined statement, as
there can always be more messages published. You may know that nobody will
publish any more messages, but the pipeline does not.
   2. While it's possible to read from Pub/Sub in batch, it's usually not
recommended. For one thing I don't think that the batch runner can maintain
exactly-once processing when reading from Pub/Sub.
   3. In Java you can turn an unbounded source (Pub/Sub) into a bounded
source that can in theory be used for batch jobs. However this is done by
specifying either the max time to read or the max number of messages. I
don't think there's any way to automatically read the Pub/Sub topic until
there are no more messages in it.

Reuven

On Thu, Jan 18, 2024 at 2:25 AM Sumit Desai via user 
wrote:

> Hi all,
>
> I want to create a Dataflow pipeline using Pub/sub as an input connector
> but I want to run it in batch mode and not streaming mode. I know it's not
> possible in Python but how can I achieve this in Java? Basically, I want my
> pipeline to read all messages in a Pubsub topic, process and terminate.
> Please suggest.
>
> Thanks & Regards,
> Sumit Desai
>


Using Dataflow with Pubsub input connector in batch mode

2024-01-18 Thread Sumit Desai via user
Hi all,

I want to create a Dataflow pipeline using Pub/sub as an input connector
but I want to run it in batch mode and not streaming mode. I know it's not
possible in Python but how can I achieve this in Java? Basically, I want my
pipeline to read all messages in a Pubsub topic, process and terminate.
Please suggest.

Thanks & Regards,
Sumit Desai