I'm not sure I understand this use case well. What are you planning on
doing with the BQ dataset if it were processed first? Were you planning on
caching information in memory? Storing data in Beam state? Something else?
On Wed, Mar 1, 2023 at 10:43 AM Kenneth Knowles wrote:
>
>
> On Tue, Feb
On Tue, Feb 28, 2023 at 5:14 PM Sahil Modak
wrote:
> The number of keys/data in BQ would not be constant and grow with time.
>
> A rough estimate would be around 300k keys with an average size of 5kb per
> key. Both the count of the keys and the size of the key would be feature
> dependent
Regarding ordering; anything that requires inputs to be in a specific
order in Beam will be problematic due the nature of parallel processing -
you will always get race conditions.
Assuming you are still intending to Flatten the bigQuery and PubSub
PCollections, using Wait(on) before flattening
The number of keys/data in BQ would not be constant and grow with time.
A rough estimate would be around 300k keys with an average size of 5kb per
key. Both the count of the keys and the size of the key would be feature
dependent (based on the upstream pipelines) and we won't have control over
I'm also curious how much you depend on order to get the state contents
right. The ordering of the side input will be arbitrary, and even the
streaming input can have plenty of out of order messages. So I want to
think about what are the data dependencies that result in the requirement
of order.
In the case that the data is too large for side input, you could do the
same by reassigning timestamps of the BQ input to
BoundedWindow.TIMESTAMP_MIN_VALUE (you would have to do that in a
stateful DoFn with a timer having outputTimestamp set to
TIMESTAMP_MIN_VALUE to hold watermark, or using
How large is this state spec stored in BQ? If the size isn't too large, you
can read it from BQ and make it a side input into the DoFn.
On Mon, Feb 27, 2023 at 11:06 AM Sahil Modak
wrote:
> We are trying to re-initialize our state specs in the BusinessLogic() DoFn
> from BQ.
> BQ has data about
Why not pass the BQ data as.a side input to your transform?
Side inputs are read fully and materialised before the transform starts.
This will allow your transform to initialize its state before processing
any elements from the PubSub input.
On Mon, 27 Feb 2023, 20:43 Daniel Collins via dev,
It sounds like what you're doing here might be best done outside the beam
model. Instead of performing the initial computation reading from BQ into a
PCollection, perform it using the BigQuery client library in the same
manner as you currently do to load the data from redis.
On Mon, Feb 27, 2023
We are trying to re-initialize our state specs in the BusinessLogic() DoFn
from BQ.
BQ has data about the state spec, and we would like to make sure that the
state specs in our BusinessLogic() dofn are initialized before it starts
consuming the pub/sub.
This is for handling the case of
My suggestion is to try to solve the problem in terms of what you want to
compute. Instead of trying to control the operational aspects like "read
all the BQ before reading Pubsub" there is presumably some reason that the
BQ data naturally "comes first", for example if its timestamps are earlier
First PCollections are completely unordered, so there is no guarantee on
what order you'll see events in the flattened PCollection.
There may be ways to process the BigQuery data in a separate transform
first, but it depends on the structure of the data. How large is the
BigQuery table? Are you
Yes, this is a streaming pipeline.
Some more details about existing implementation v/s what we want to achieve.
Current implementation:
Reading from pub-sub:
Pipeline input = Pipeline.create(options);
PCollection pubsubStream = input.apply("Read From Pubsub",
Can you explain this use case some more? Is this a streaming pipeline? If
so, how are you reading from BigQuery?
On Thu, Feb 23, 2023 at 10:06 PM Sahil Modak via dev
wrote:
> Hi,
>
> We have a requirement wherein we are consuming input from pub/sub
> (PubSubIO) as well as BQ (BQIO)
>
> We want
Hi,
We have a requirement wherein we are consuming input from pub/sub
(PubSubIO) as well as BQ (BQIO)
We want to make sure that we consume the BQ stream first before we start
consuming the data from pub-sub. Is there a way to achieve this? Can you
please help with some code samples?
Currently,
15 matches
Mail list logo