I think I'm still struggling a bit... Let's stick with a bounded example for now. I would be reading from a single mongo cluster/database/collection/partition that has billions of things in it. I read through the mongoio code a bit and it seems to:
1. Get the min ID 2. Get the max ID 3. Split that range by bundle size 4. Read those ranges In the case of billions I imagine that would be a large number of splits. My (possibly incorrect) understanding is that Beam would try to go parallel here. Ideally I would have some way to say only read x bundles at once. Thanks! On Wed, Apr 9, 2025 at 5:32 AM Radek Stankiewicz <radosl...@google.com> wrote: > hey Jonathan, > > parallelism for read and write is directly related to the amount of keys > you are processing in the current stage. > As an example - Imagine you have KafkaIO with 1 partition - and after > reading from KafkaIO you have a mapping step to JDBC entity and then you > have a step writing to the database. > If you keep all those steps in the same stage, you will have single > threaded write to the database. > If you add a Redistribute transform with numKeys set to N, you may have up > to N parallel writes. More and more IOs are leveraging this transform to > control the parallelism. > Rate limiting is per worker, if you have a backlog and unlimited keys, > runner may decide to add more workers which will eventually increase > amount of calls. > > If you are using java and JDBCIO, you can specify the amount of splits by > setting numPartitions and specifying the partitioning column. > > What I would recommend investigating is connection pooling as this is also > problematic with operational databases. For JDBC, hikariCP is pretty easy > to integrate. > > On Thu, Apr 3, 2025 at 12:32 AM Jonathan Hope < > jonathan.douglas.h...@gmail.com> wrote: > >> Hello all, >> >> The pipeline I'm working on will be run against two different databases >> that are both online so I will have to control the amount of data being >> read/written to maintain QoS. To do so I would have to control the number >> of workers that are being executed in parallel for a given step. >> >> I saw this issue on Github: https://github.com/apache/beam/issues/17835. >> However it has been closed and I don't see any related PRs/comments/etc. >> Did that work get done, or did it just get cancelled? >> >> I also saw this post: >> https://medium.com/art-of-data-engineering/steady-as-she-flows-rate-limiting-in-apache-beam-pipelines-42cab0b7f31d. >> That would definitely work, but it would result in workers spinning up and >> waiting on locks (which costs money). >> >> Perhaps this is more of a runner concern though, and I did see a way to >> limit the maximum number of workers here: >> https://cloud.google.com/dataflow/docs/reference/pipeline-options. I >> believe that would also work, but then the max number of workers would be >> dictated by whatever step has the highest performance cost which would >> result in the pipelines being slower overall. >> >> Thanks! >> >