We have a beam/dataflow pipeline that reads from pubsub. as a first step in pipeline we do DeduplicatePerKey of incoming records based on key and event_time_duration of 4 hours. we notice that this increases the walltime of this step a lot (in days) . And Dataflow runner autoscales a lot. It easily reaches 100 workers and stays within 60-100 ranges. We tried reducing the deduping window to 30 mins (and 1 hour window) but walltime is still in days and workers stays in 60- 100 range. every end of the window worker count drops but not much. We noticed high sustained CPU (over 90%) for all workers and low memory usages. This happens just over 20M pubsub message read. Is DeduplicatePerKey is just an expensive operation or is it due to window size?
ALso can't determine if auto scaling is due to increased walltime/data freshnes Or (due to workers being held to store state and same workers can't be use for other transforms) Or deduping is very costly. (there's not much api doc on how DeduplicatePerKey works. I assume it will passthrough an event with same key and store it same time so next time if event with same key comes through withing the event_time_duration it will drop that new event and not let it passthrough. not sure why that lookup be so expensive.