Ho Joey,
the fanout parameter takes two types of parameters (besides None):
a) int constant
b) function mapping a key to int
Both types result in a mapping from key to integer, where this integer
signals how many downstream copies (or "buckets") there should be for a
given key. If there is more than 1 bucket for given key, then a random
bucket is chosen for given input key-value pair and the pair is sent for
processing to the selected "bucket". These buckets are then combined
together to produce final value per key. This has the impact that if the
"fanout" is too high for a scarce key, then it might result in no
combining at all (all instances of the specific key will end-up in own
bucket). This will result in increased shuffling and lower efficiency.
Best,
Jan
On 1/16/25 22:01, Joey Tran wrote:
Hi,
I've read the documentation for CombinePerKey with hot key fanout and
I think I understand it at a high level (split up and combine sharded
keys before merging all values in one key) but I'm confused by the
parameter that this method takes and how it affects the behavior of
the transform.
Is the fanout parameter the... number of shards per key? Am I thinking
about this right? Any help would be appreciated, thanks!
Joey
[1]
https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.core.html#apache_beam.transforms.core.CombinePerKey.with_hot_key_fanout