Hi, Arjun.
IMO, if ythe key "customer_id" is not very sparse, I think using "customer_id" as key directly is ok because it can take full advantage of concurrency satisfy the rescale of the parallelism over 10 and . Else I think using "customer_id mod 10" as key is ok, to solver the potential question about sparse keys, but the disadvantage is that concurrency is limited by 10. There are similar ideas in flink, you can refer to the following doc[1] for more details. [1] https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/tuning/#split-distinct-aggregation -- Best! Xuyang At 2023-12-04 23:24:34, "arjun s" <arjunjoice...@gmail.com> wrote: Hello team, I'm currently working on a Flink use case where I need to calculate the sum of occurrences for each "customer_id" within a 10-minute duration and send the results to Kafka, associating each "customer_id" with its corresponding count (e.g., 101:5). In this scenario, my data source is a file, and I'm creating a keyed data stream. With approximately one million entries in the file, I'm uncertain about the optimal keying strategy. Specifically, I'm trying to decide whether to use each customer_id directly as the key or to use the modulus of 10 for the customer_id as the key. Could you please provide guidance on which approach would yield better performance? Thanks and regards, Arjun