That one looks interesting What is not clear to me is what are the advantages of using it? Is only the error/retry handling? anything in terms of performance?
My PCollection is unbounded but I was thinking of sending my messages in batches to the external API in order to gain some performance (don't expect to send 1 http request per message). Thank you very much for all your responses! On Sun, Apr 14, 2024 at 8:28 AM XQ Hu via user <[email protected]> wrote: > > To enrich your data, have you checked > https://cloud.google.com/dataflow/docs/guides/enrichment? > > This transform is built on top of > https://beam.apache.org/documentation/io/built-in/webapis/ > > On Fri, Apr 12, 2024 at 4:38 PM Ruben Vargas <[email protected]> wrote: >> >> On Fri, Apr 12, 2024 at 2:17 PM Jaehyeon Kim <[email protected]> wrote: >> > >> > Here is an example from a book that I'm reading now and it may be >> > applicable. >> > >> > JAVA - (id.hashCode() & Integer.MAX_VALUE) % 100 >> > PYTHON - ord(id[0]) % 100 >> >> Maybe this is what I'm looking for. I'll give it a try. Thanks! >> >> > >> > On Sat, 13 Apr 2024 at 06:12, George Dekermenjian <[email protected]> >> > wrote: >> >> >> >> How about just keeping track of a buffer and flush the buffer after 100 >> >> messages and if there is a buffer on finish_bundle as well? >> >> >> >> >> >> If this is in memory, It could lead to potential loss of data. That is >> why the state is used or at least that is my understanding. but maybe >> there is a way to do this in the state? >> >> >> >> On Fri, Apr 12, 2024 at 21.23 Ruben Vargas <[email protected]> >> >> wrote: >> >>> >> >>> Hello guys >> >>> >> >>> Maybe this question was already answered, but I cannot find it and >> >>> want some more input on this topic. >> >>> >> >>> I have some messages that don't have any particular key candidate, >> >>> except the ID, but I don't want to use it because the idea is to >> >>> group multiple IDs in the same batch. >> >>> >> >>> This is my use case: >> >>> >> >>> I have an endpoint where I'm gonna send the message ID, this endpoint >> >>> is gonna return me certain information which I will use to enrich my >> >>> message. In order to avoid fetching the endpoint per message I want to >> >>> batch it in 100 and send the 100 IDs in one request ( the endpoint >> >>> supports it) . I was thinking on using GroupIntoBatches. >> >>> >> >>> - If I choose the ID as the key, my understanding is that it won't >> >>> work in the way I want (because it will form batches of the same ID). >> >>> - Use a constant will be a problem for parallelism, is that correct? >> >>> >> >>> Then my question is, what should I use as a key? Maybe something >> >>> regarding the timestamp? so I can have groups of messages that arrive >> >>> at a certain second? >> >>> >> >>> Any suggestions would be appreciated >> >>> >> >>> Thanks.
