Ho Joey,

the fanout parameter takes two types of parameters (besides None):

 a) int constant

 b) function mapping a key to int

Both types result in a mapping from key to integer, where this integer signals how many downstream copies (or "buckets") there should be for a given key. If there is more than 1 bucket for given key, then a random bucket is chosen for given input key-value pair and the pair is sent for processing to the selected "bucket". These buckets are then combined together to produce final value per key. This has the impact that if the "fanout" is too high for a scarce key, then it might result in no combining at all (all instances of the specific key will end-up in own bucket). This will result in increased shuffling and lower efficiency.

Best,

 Jan

On 1/16/25 22:01, Joey Tran wrote:
Hi,

I've read the documentation for CombinePerKey with hot key fanout and I think I understand it at a high level (split up and combine sharded keys before merging all values in one key) but I'm confused by the parameter that this method takes and how it affects the behavior of the transform.

Is the fanout parameter the... number of shards per key? Am I thinking about this right? Any help would be appreciated, thanks!

Joey

[1] https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.core.html#apache_beam.transforms.core.CombinePerKey.with_hot_key_fanout

Reply via email to