Re: Any recomendation for key for GroupIntoBatches

Ruben Vargas Mon, 15 Apr 2024 07:38:26 -0700

That one looks interesting

What is not clear to me is what are the advantages of using it? Is
only the error/retry handling? anything in terms of performance?


My PCollection is unbounded but I was thinking of sending my messages
in batches to the external API in order to gain some performance
(don't expect to send 1 http request per message).

Thank you very much for all your responses!


On Sun, Apr 14, 2024 at 8:28 AM XQ Hu via user <[email protected]> wrote:
>
> To enrich your data, have you checked 
> https://cloud.google.com/dataflow/docs/guides/enrichment?
>
> This transform is built on top of 
> https://beam.apache.org/documentation/io/built-in/webapis/
>
> On Fri, Apr 12, 2024 at 4:38 PM Ruben Vargas <[email protected]> wrote:
>>
>> On Fri, Apr 12, 2024 at 2:17 PM Jaehyeon Kim <[email protected]> wrote:
>> >
>> > Here is an example from a book that I'm reading now and it may be 
>> > applicable.
>> >
>> > JAVA - (id.hashCode() & Integer.MAX_VALUE) % 100
>> > PYTHON - ord(id[0]) % 100
>>
>> Maybe this is what I'm looking for. I'll give it a try. Thanks!
>>
>> >
>> > On Sat, 13 Apr 2024 at 06:12, George Dekermenjian <[email protected]> 
>> > wrote:
>> >>
>> >> How about just keeping track of a buffer and flush the buffer after 100 
>> >> messages and if there is a buffer on finish_bundle as well?
>> >>
>> >>
>>
>> If this is in memory, It could lead to potential loss of data. That is
>> why the state is used or at least that is my understanding. but maybe
>> there is a way to do this in the state?
>>
>>
>> >> On Fri, Apr 12, 2024 at 21.23 Ruben Vargas <[email protected]> 
>> >> wrote:
>> >>>
>> >>> Hello guys
>> >>>
>> >>> Maybe this question was already answered, but I cannot find it  and
>> >>> want some more input on this topic.
>> >>>
>> >>> I have some messages that don't have any particular key candidate,
>> >>> except the ID,  but I don't want to use it because the idea is to
>> >>> group multiple IDs in the same batch.
>> >>>
>> >>> This is my use case:
>> >>>
>> >>> I have an endpoint where I'm gonna send the message ID, this endpoint
>> >>> is gonna return me certain information which I will use to enrich my
>> >>> message. In order to avoid fetching the endpoint per message I want to
>> >>> batch it in 100 and send the 100 IDs in one request ( the endpoint
>> >>> supports it) . I was thinking on using GroupIntoBatches.
>> >>>
>> >>> - If I choose the ID as the key, my understanding is that it won't
>> >>> work in the way I want (because it will form batches of the same ID).
>> >>> - Use a constant will be a problem for parallelism, is that correct?
>> >>>
>> >>> Then my question is, what should I use as a key? Maybe something
>> >>> regarding the timestamp? so I can have groups of messages that arrive
>> >>> at a certain second?
>> >>>
>> >>> Any suggestions would be appreciated
>> >>>
>> >>> Thanks.

Re: Any recomendation for key for GroupIntoBatches

Reply via email to