Could you possibly use a side input with fixed interval triggering[1] to
query the Dataflow API to get the most recent log statement of scaling as
suggested here[2]?

[1]
https://beam.apache.org/documentation/patterns/side-inputs/
[2]
https://stackoverflow.com/a/54406878/6432284

On Thu, Apr 15, 2021 at 18:14 Daniel Thevessen <[email protected]> wrote:

> Hi folks,
>
> I've been working on a custom PTransform that makes requests to another
> service, and would like to add a rate limiting feature there. The
> fundamental issue that I'm running into here is that I need a decent
> heuristic to estimate the worker count, so that each worker can
> independently set a limit which globally comes out to the right value. All
> of this is easy if I know how many machines I have, but I'd like to use
> Dataflow's autoscaling, which would easily break any pre-configured value.
> I have seen two main approaches for rate limiting, both for a configurable
> variable x:
>
>    - Simply assume worker count is x, then divide by x to figure out the
>    "local" limit. The issue I have here is that if we assume x is 500, but it
>    is actually 50, I'm now paying for 50 nodes to throttle 10 times as much as
>    necessary. I know the pipeline options have a reference to the runner, is
>    it possible to get an approximate current worker count from that at bundle
>    start (*if* runner is DataflowRunner)?
>    - Add another PTransform in front of the API requests, which groups by
>    x number of keys, throttles, and keeps forwarding elements with an instant
>    trigger. I initially really liked this solution because even if x is
>    misconfigured, I will have at most x workers running and throttle
>    appropriately. However, I noticed that for batch pipelines, this
>    effectively also caps the API request stage at x workers. If I throw in a
>    `Reshuffle`, there is another GroupByKey (-> another stage), and nothing
>    gets done until every element has passed through the throttler.
>
> Has anyone here tried to figure out rate limiting with Beam before, and
> perhaps run into similar issues? I would love to know if there is a
> preferred solution to this type of problem.
> I know sharing state like that runs a little counter to the Beam pipeline
> paradigm, but really all I need is an approximate worker count with few
> guarantees.
>
> Cheers,
> Daniel
>

Reply via email to