Re: Any recomendation for key for GroupIntoBatches
On Fri, Apr 12, 2024 at 2:17 PM Jaehyeon Kim wrote: > > Here is an example from a book that I'm reading now and it may be applicable. > > JAVA - (id.hashCode() & Integer.MAX_VALUE) % 100 > PYTHON - ord(id[0]) % 100 Maybe this is what I'm looking for. I'll give it a try. Thanks! > > On Sat, 13 Apr 2024 at 06:12, George Dekermenjian wrote: >> >> How about just keeping track of a buffer and flush the buffer after 100 >> messages and if there is a buffer on finish_bundle as well? >> >> If this is in memory, It could lead to potential loss of data. That is why the state is used or at least that is my understanding. but maybe there is a way to do this in the state? >> On Fri, Apr 12, 2024 at 21.23 Ruben Vargas wrote: >>> >>> Hello guys >>> >>> Maybe this question was already answered, but I cannot find it and >>> want some more input on this topic. >>> >>> I have some messages that don't have any particular key candidate, >>> except the ID, but I don't want to use it because the idea is to >>> group multiple IDs in the same batch. >>> >>> This is my use case: >>> >>> I have an endpoint where I'm gonna send the message ID, this endpoint >>> is gonna return me certain information which I will use to enrich my >>> message. In order to avoid fetching the endpoint per message I want to >>> batch it in 100 and send the 100 IDs in one request ( the endpoint >>> supports it) . I was thinking on using GroupIntoBatches. >>> >>> - If I choose the ID as the key, my understanding is that it won't >>> work in the way I want (because it will form batches of the same ID). >>> - Use a constant will be a problem for parallelism, is that correct? >>> >>> Then my question is, what should I use as a key? Maybe something >>> regarding the timestamp? so I can have groups of messages that arrive >>> at a certain second? >>> >>> Any suggestions would be appreciated >>> >>> Thanks.
Re: Any recomendation for key for GroupIntoBatches
Here is an example from a book that I'm reading now and it may be applicable. JAVA - (id.hashCode() & Integer.MAX_VALUE) % 100 PYTHON - ord(id[0]) % 100 On Sat, 13 Apr 2024 at 06:12, George Dekermenjian wrote: > How about just keeping track of a buffer and flush the buffer after 100 > messages and if there is a buffer on finish_bundle as well? > > > On Fri, Apr 12, 2024 at 21.23 Ruben Vargas > wrote: > >> Hello guys >> >> Maybe this question was already answered, but I cannot find it and >> want some more input on this topic. >> >> I have some messages that don't have any particular key candidate, >> except the ID, but I don't want to use it because the idea is to >> group multiple IDs in the same batch. >> >> This is my use case: >> >> I have an endpoint where I'm gonna send the message ID, this endpoint >> is gonna return me certain information which I will use to enrich my >> message. In order to avoid fetching the endpoint per message I want to >> batch it in 100 and send the 100 IDs in one request ( the endpoint >> supports it) . I was thinking on using GroupIntoBatches. >> >> - If I choose the ID as the key, my understanding is that it won't >> work in the way I want (because it will form batches of the same ID). >> - Use a constant will be a problem for parallelism, is that correct? >> >> Then my question is, what should I use as a key? Maybe something >> regarding the timestamp? so I can have groups of messages that arrive >> at a certain second? >> >> Any suggestions would be appreciated >> >> Thanks. >> >
Re: Any recomendation for key for GroupIntoBatches
How about just keeping track of a buffer and flush the buffer after 100 messages and if there is a buffer on finish_bundle as well? On Fri, Apr 12, 2024 at 21.23 Ruben Vargas wrote: > Hello guys > > Maybe this question was already answered, but I cannot find it and > want some more input on this topic. > > I have some messages that don't have any particular key candidate, > except the ID, but I don't want to use it because the idea is to > group multiple IDs in the same batch. > > This is my use case: > > I have an endpoint where I'm gonna send the message ID, this endpoint > is gonna return me certain information which I will use to enrich my > message. In order to avoid fetching the endpoint per message I want to > batch it in 100 and send the 100 IDs in one request ( the endpoint > supports it) . I was thinking on using GroupIntoBatches. > > - If I choose the ID as the key, my understanding is that it won't > work in the way I want (because it will form batches of the same ID). > - Use a constant will be a problem for parallelism, is that correct? > > Then my question is, what should I use as a key? Maybe something > regarding the timestamp? so I can have groups of messages that arrive > at a certain second? > > Any suggestions would be appreciated > > Thanks. >
Any recomendation for key for GroupIntoBatches
Hello guys Maybe this question was already answered, but I cannot find it and want some more input on this topic. I have some messages that don't have any particular key candidate, except the ID, but I don't want to use it because the idea is to group multiple IDs in the same batch. This is my use case: I have an endpoint where I'm gonna send the message ID, this endpoint is gonna return me certain information which I will use to enrich my message. In order to avoid fetching the endpoint per message I want to batch it in 100 and send the 100 IDs in one request ( the endpoint supports it) . I was thinking on using GroupIntoBatches. - If I choose the ID as the key, my understanding is that it won't work in the way I want (because it will form batches of the same ID). - Use a constant will be a problem for parallelism, is that correct? Then my question is, what should I use as a key? Maybe something regarding the timestamp? so I can have groups of messages that arrive at a certain second? Any suggestions would be appreciated Thanks.