Re: Any recomendation for key for GroupIntoBatches

2024-04-12 Thread Ruben Vargas
On Fri, Apr 12, 2024 at 2:17 PM Jaehyeon Kim  wrote:
>
> Here is an example from a book that I'm reading now and it may be applicable.
>
> JAVA - (id.hashCode() & Integer.MAX_VALUE) % 100
> PYTHON - ord(id[0]) % 100

Maybe this is what I'm looking for. I'll give it a try. Thanks!

>
> On Sat, 13 Apr 2024 at 06:12, George Dekermenjian  wrote:
>>
>> How about just keeping track of a buffer and flush the buffer after 100 
>> messages and if there is a buffer on finish_bundle as well?
>>
>>

If this is in memory, It could lead to potential loss of data. That is
why the state is used or at least that is my understanding. but maybe
there is a way to do this in the state?


>> On Fri, Apr 12, 2024 at 21.23 Ruben Vargas  wrote:
>>>
>>> Hello guys
>>>
>>> Maybe this question was already answered, but I cannot find it  and
>>> want some more input on this topic.
>>>
>>> I have some messages that don't have any particular key candidate,
>>> except the ID,  but I don't want to use it because the idea is to
>>> group multiple IDs in the same batch.
>>>
>>> This is my use case:
>>>
>>> I have an endpoint where I'm gonna send the message ID, this endpoint
>>> is gonna return me certain information which I will use to enrich my
>>> message. In order to avoid fetching the endpoint per message I want to
>>> batch it in 100 and send the 100 IDs in one request ( the endpoint
>>> supports it) . I was thinking on using GroupIntoBatches.
>>>
>>> - If I choose the ID as the key, my understanding is that it won't
>>> work in the way I want (because it will form batches of the same ID).
>>> - Use a constant will be a problem for parallelism, is that correct?
>>>
>>> Then my question is, what should I use as a key? Maybe something
>>> regarding the timestamp? so I can have groups of messages that arrive
>>> at a certain second?
>>>
>>> Any suggestions would be appreciated
>>>
>>> Thanks.


Re: Any recomendation for key for GroupIntoBatches

2024-04-12 Thread Jaehyeon Kim
Here is an example from a book that I'm reading now and it may be
applicable.

JAVA - (id.hashCode() & Integer.MAX_VALUE) % 100
PYTHON - ord(id[0]) % 100

On Sat, 13 Apr 2024 at 06:12, George Dekermenjian  wrote:

> How about just keeping track of a buffer and flush the buffer after 100
> messages and if there is a buffer on finish_bundle as well?
>
>
> On Fri, Apr 12, 2024 at 21.23 Ruben Vargas 
> wrote:
>
>> Hello guys
>>
>> Maybe this question was already answered, but I cannot find it  and
>> want some more input on this topic.
>>
>> I have some messages that don't have any particular key candidate,
>> except the ID,  but I don't want to use it because the idea is to
>> group multiple IDs in the same batch.
>>
>> This is my use case:
>>
>> I have an endpoint where I'm gonna send the message ID, this endpoint
>> is gonna return me certain information which I will use to enrich my
>> message. In order to avoid fetching the endpoint per message I want to
>> batch it in 100 and send the 100 IDs in one request ( the endpoint
>> supports it) . I was thinking on using GroupIntoBatches.
>>
>> - If I choose the ID as the key, my understanding is that it won't
>> work in the way I want (because it will form batches of the same ID).
>> - Use a constant will be a problem for parallelism, is that correct?
>>
>> Then my question is, what should I use as a key? Maybe something
>> regarding the timestamp? so I can have groups of messages that arrive
>> at a certain second?
>>
>> Any suggestions would be appreciated
>>
>> Thanks.
>>
>


Re: Any recomendation for key for GroupIntoBatches

2024-04-12 Thread George Dekermenjian
How about just keeping track of a buffer and flush the buffer after 100
messages and if there is a buffer on finish_bundle as well?


On Fri, Apr 12, 2024 at 21.23 Ruben Vargas  wrote:

> Hello guys
>
> Maybe this question was already answered, but I cannot find it  and
> want some more input on this topic.
>
> I have some messages that don't have any particular key candidate,
> except the ID,  but I don't want to use it because the idea is to
> group multiple IDs in the same batch.
>
> This is my use case:
>
> I have an endpoint where I'm gonna send the message ID, this endpoint
> is gonna return me certain information which I will use to enrich my
> message. In order to avoid fetching the endpoint per message I want to
> batch it in 100 and send the 100 IDs in one request ( the endpoint
> supports it) . I was thinking on using GroupIntoBatches.
>
> - If I choose the ID as the key, my understanding is that it won't
> work in the way I want (because it will form batches of the same ID).
> - Use a constant will be a problem for parallelism, is that correct?
>
> Then my question is, what should I use as a key? Maybe something
> regarding the timestamp? so I can have groups of messages that arrive
> at a certain second?
>
> Any suggestions would be appreciated
>
> Thanks.
>


Any recomendation for key for GroupIntoBatches

2024-04-12 Thread Ruben Vargas
Hello guys

Maybe this question was already answered, but I cannot find it  and
want some more input on this topic.

I have some messages that don't have any particular key candidate,
except the ID,  but I don't want to use it because the idea is to
group multiple IDs in the same batch.

This is my use case:

I have an endpoint where I'm gonna send the message ID, this endpoint
is gonna return me certain information which I will use to enrich my
message. In order to avoid fetching the endpoint per message I want to
batch it in 100 and send the 100 IDs in one request ( the endpoint
supports it) . I was thinking on using GroupIntoBatches.

- If I choose the ID as the key, my understanding is that it won't
work in the way I want (because it will form batches of the same ID).
- Use a constant will be a problem for parallelism, is that correct?

Then my question is, what should I use as a key? Maybe something
regarding the timestamp? so I can have groups of messages that arrive
at a certain second?

Any suggestions would be appreciated

Thanks.