Hi, Arjun.

IMO, if ythe key "customer_id" is not very sparse, I think using "customer_id" 
as key directly is ok because it can take full advantage of concurrency satisfy 
the rescale of the parallelism over 10 and . Else I think using "customer_id 
mod 10" as key is ok, to solver the potential question about sparse keys, but 
the disadvantage is that concurrency is limited by 10.


There are similar ideas in flink, you can refer to the following doc[1] for 
more details.


[1] 
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/tuning/#split-distinct-aggregation




--

    Best!
    Xuyang




At 2023-12-04 23:24:34, "arjun s" <arjunjoice...@gmail.com> wrote:

Hello team,

I'm currently working on a Flink use case where I need to calculate the sum of 
occurrences for each "customer_id" within a 10-minute duration and send the 
results to Kafka, associating each "customer_id" with its corresponding count 
(e.g., 101:5).

In this scenario, my data source is a file, and I'm creating a keyed data 
stream. With approximately one million entries in the file, I'm uncertain about 
the optimal keying strategy. Specifically, I'm trying to decide whether to use 
each customer_id directly as the key or to use the modulus of 10 for the 
customer_id as the key. Could you please provide guidance on which approach 
would yield better performance?

Thanks and regards,
Arjun

Reply via email to