Re: Any recomendation for key for GroupIntoBatches

2024-04-28 Thread Wiśniowski Piotr
Hi, Might be late to the discussion, but providing another option (as I think it was not mentioned or I missed it). Take a look at [this](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.util.html#apache_beam.transforms.util.BatchElements) as I think this is precisely

Re: Any recomendation for key for GroupIntoBatches

2024-04-15 Thread Robert Bradshaw via user
On Fri, Apr 12, 2024 at 1:39 PM Ruben Vargas wrote: > On Fri, Apr 12, 2024 at 2:17 PM Jaehyeon Kim wrote: > > > > Here is an example from a book that I'm reading now and it may be > applicable. > > > > JAVA - (id.hashCode() & Integer.MAX_VALUE) % 100 > > PYTHON - ord(id[0]) % 100 > or

Re: Any recomendation for key for GroupIntoBatches

2024-04-15 Thread Reuven Lax via user
There are various strategies. Here is an example of how Beam does it (taken from Reshuffle.viaRandomKey().withNumBuckets(N) Note that this does some extra hashing to work around issues with the Spark runner. If you don't care about that, you could implement something simpler (e.g. initialize

Re: Any recomendation for key for GroupIntoBatches

2024-04-15 Thread Damon Douglas
Good day, Ruben, Would you be able to compute a shasum on the group of IDs to use as the key? Best, Damon On 2024/04/12 19:22:45 Ruben Vargas wrote: > Hello guys > > Maybe this question was already answered, but I cannot find it and > want some more input on this topic. > > I have some

Re: Any recomendation for key for GroupIntoBatches

2024-04-15 Thread Ruben Vargas
Yeah unfortunately the data on the endpoint could change at any point in time and I need to make sure to have the latest one :/ That limits my options here. But I also have other sources that can benefit from this caching :) Thank you very much! On Mon, Apr 15, 2024 at 9:37 AM XQ Hu wrote: >

Re: Any recomendation for key for GroupIntoBatches

2024-04-15 Thread XQ Hu via user
I am not sure you still need to do batching since Web API can handle caching. If you really need it, I think GoupIntoBatches is a good way to go. On Mon, Apr 15, 2024 at 11:30 AM Ruben Vargas wrote: > Is there a way to do batching in that transformation? I'm assuming for > now no. or may be

Re: Any recomendation for key for GroupIntoBatches

2024-04-15 Thread Ruben Vargas
Is there a way to do batching in that transformation? I'm assuming for now no. or may be using in conjuntion with GoupIntoBatches On Mon, Apr 15, 2024 at 9:29 AM Ruben Vargas wrote: > > Interesting > > I think the cache feature could be interesting for some use cases I have. > > On Mon, Apr 15,

Re: Any recomendation for key for GroupIntoBatches

2024-04-15 Thread Ruben Vargas
Interesting I think the cache feature could be interesting for some use cases I have. On Mon, Apr 15, 2024 at 9:18 AM XQ Hu wrote: > > For the new web API IO, the page lists these features: > > developers provide minimal code that invokes Web API endpoint > delegate to the transform to handle

Re: Any recomendation for key for GroupIntoBatches

2024-04-15 Thread XQ Hu via user
For the new web API IO, the page lists these features: - developers provide minimal code that invokes Web API endpoint - delegate to the transform to handle request retries and exponential backoff - optional caching of request and response associations - optional metrics On Mon,

Re: Any recomendation for key for GroupIntoBatches

2024-04-15 Thread Ruben Vargas
That one looks interesting What is not clear to me is what are the advantages of using it? Is only the error/retry handling? anything in terms of performance? My PCollection is unbounded but I was thinking of sending my messages in batches to the external API in order to gain some performance

Re: Any recomendation for key for GroupIntoBatches

2024-04-14 Thread XQ Hu via user
To enrich your data, have you checked https://cloud.google.com/dataflow/docs/guides/enrichment? This transform is built on top of https://beam.apache.org/documentation/io/built-in/webapis/ On Fri, Apr 12, 2024 at 4:38 PM Ruben Vargas wrote: > On Fri, Apr 12, 2024 at 2:17 PM Jaehyeon Kim

Re: Any recomendation for key for GroupIntoBatches

2024-04-12 Thread Ruben Vargas
On Fri, Apr 12, 2024 at 2:17 PM Jaehyeon Kim wrote: > > Here is an example from a book that I'm reading now and it may be applicable. > > JAVA - (id.hashCode() & Integer.MAX_VALUE) % 100 > PYTHON - ord(id[0]) % 100 Maybe this is what I'm looking for. I'll give it a try. Thanks! > > On Sat, 13

Re: Any recomendation for key for GroupIntoBatches

2024-04-12 Thread Jaehyeon Kim
Here is an example from a book that I'm reading now and it may be applicable. JAVA - (id.hashCode() & Integer.MAX_VALUE) % 100 PYTHON - ord(id[0]) % 100 On Sat, 13 Apr 2024 at 06:12, George Dekermenjian wrote: > How about just keeping track of a buffer and flush the buffer after 100 > messages

Re: Any recomendation for key for GroupIntoBatches

2024-04-12 Thread George Dekermenjian
How about just keeping track of a buffer and flush the buffer after 100 messages and if there is a buffer on finish_bundle as well? On Fri, Apr 12, 2024 at 21.23 Ruben Vargas wrote: > Hello guys > > Maybe this question was already answered, but I cannot find it and > want some more input on

Any recomendation for key for GroupIntoBatches

2024-04-12 Thread Ruben Vargas
Hello guys Maybe this question was already answered, but I cannot find it and want some more input on this topic. I have some messages that don't have any particular key candidate, except the ID, but I don't want to use it because the idea is to group multiple IDs in the same batch. This is