Re: Consuming one PCollection before consuming another with Beam

2023-03-01 Thread Reuven Lax via dev
I'm not sure I understand this use case well. What are you planning on doing with the BQ dataset if it were processed first? Were you planning on caching information in memory? Storing data in Beam state? Something else? On Wed, Mar 1, 2023 at 10:43 AM Kenneth Knowles wrote: > > > On Tue, Feb

Re: Consuming one PCollection before consuming another with Beam

2023-03-01 Thread Kenneth Knowles
On Tue, Feb 28, 2023 at 5:14 PM Sahil Modak wrote: > The number of keys/data in BQ would not be constant and grow with time. > > A rough estimate would be around 300k keys with an average size of 5kb per > key. Both the count of the keys and the size of the key would be feature > dependent

Re: Consuming one PCollection before consuming another with Beam

2023-02-28 Thread Niel Markwick via dev
Regarding ordering; anything that requires inputs to be in a specific order in Beam will be problematic due the nature of parallel processing - you will always get race conditions. Assuming you are still intending to Flatten the bigQuery and PubSub PCollections, using Wait(on) before flattening

Re: Consuming one PCollection before consuming another with Beam

2023-02-28 Thread Sahil Modak via dev
The number of keys/data in BQ would not be constant and grow with time. A rough estimate would be around 300k keys with an average size of 5kb per key. Both the count of the keys and the size of the key would be feature dependent (based on the upstream pipelines) and we won't have control over

Re: Consuming one PCollection before consuming another with Beam

2023-02-28 Thread Kenneth Knowles
I'm also curious how much you depend on order to get the state contents right. The ordering of the side input will be arbitrary, and even the streaming input can have plenty of out of order messages. So I want to think about what are the data dependencies that result in the requirement of order.

Re: Consuming one PCollection before consuming another with Beam

2023-02-28 Thread Jan Lukavský
In the case that the data is too large for side input, you could do the same by reassigning timestamps of the BQ input to BoundedWindow.TIMESTAMP_MIN_VALUE (you would have to do that in a stateful DoFn with a timer having outputTimestamp set to TIMESTAMP_MIN_VALUE to hold watermark, or using

Re: Consuming one PCollection before consuming another with Beam

2023-02-27 Thread Reuven Lax via dev
How large is this state spec stored in BQ? If the size isn't too large, you can read it from BQ and make it a side input into the DoFn. On Mon, Feb 27, 2023 at 11:06 AM Sahil Modak wrote: > We are trying to re-initialize our state specs in the BusinessLogic() DoFn > from BQ. > BQ has data about

Re: Consuming one PCollection before consuming another with Beam

2023-02-27 Thread Niel Markwick via dev
Why not pass the BQ data as.a side input to your transform? Side inputs are read fully and materialised before the transform starts. This will allow your transform to initialize its state before processing any elements from the PubSub input. On Mon, 27 Feb 2023, 20:43 Daniel Collins via dev,

Re: Consuming one PCollection before consuming another with Beam

2023-02-27 Thread Daniel Collins via dev
It sounds like what you're doing here might be best done outside the beam model. Instead of performing the initial computation reading from BQ into a PCollection, perform it using the BigQuery client library in the same manner as you currently do to load the data from redis. On Mon, Feb 27, 2023

Re: Consuming one PCollection before consuming another with Beam

2023-02-27 Thread Sahil Modak via dev
We are trying to re-initialize our state specs in the BusinessLogic() DoFn from BQ. BQ has data about the state spec, and we would like to make sure that the state specs in our BusinessLogic() dofn are initialized before it starts consuming the pub/sub. This is for handling the case of

Re: Consuming one PCollection before consuming another with Beam

2023-02-24 Thread Kenneth Knowles
My suggestion is to try to solve the problem in terms of what you want to compute. Instead of trying to control the operational aspects like "read all the BQ before reading Pubsub" there is presumably some reason that the BQ data naturally "comes first", for example if its timestamps are earlier

Re: Consuming one PCollection before consuming another with Beam

2023-02-24 Thread Reuven Lax via dev
First PCollections are completely unordered, so there is no guarantee on what order you'll see events in the flattened PCollection. There may be ways to process the BigQuery data in a separate transform first, but it depends on the structure of the data. How large is the BigQuery table? Are you

Re: Consuming one PCollection before consuming another with Beam

2023-02-24 Thread Sahil Modak via dev
Yes, this is a streaming pipeline. Some more details about existing implementation v/s what we want to achieve. Current implementation: Reading from pub-sub: Pipeline input = Pipeline.create(options); PCollection pubsubStream = input.apply("Read From Pubsub",

Re: Consuming one PCollection before consuming another with Beam

2023-02-23 Thread Reuven Lax via dev
Can you explain this use case some more? Is this a streaming pipeline? If so, how are you reading from BigQuery? On Thu, Feb 23, 2023 at 10:06 PM Sahil Modak via dev wrote: > Hi, > > We have a requirement wherein we are consuming input from pub/sub > (PubSubIO) as well as BQ (BQIO) > > We want

Consuming one PCollection before consuming another with Beam

2023-02-23 Thread Sahil Modak via dev
Hi, We have a requirement wherein we are consuming input from pub/sub (PubSubIO) as well as BQ (BQIO) We want to make sure that we consume the BQ stream first before we start consuming the data from pub-sub. Is there a way to achieve this? Can you please help with some code samples? Currently,