Hello team, I am using beam java sdk 2.23.0 on dataflow.
I have a pipeline where I continuously read from Kafka and run through various transforms and finally emit output to various sinks like Elastic Search, Big Query, GCS buckets etc. There are few transforms where I maintain state of input KV pairs (using Stateful Beam model) and update the incoming data based on metadata maintained in state. Now the use case is that each of those KV pairs belong to some user activity and we don't know how long this activity will last. It can run from a few hours to a few days. And in between if the beam runs into some issue or for some maintenance we need to restart the beam job, we should be able to retain the state. In order to accommodate the above requirement without maintaining a global window, I maintain a fixed window of 24hrs and at window expiry, I do Checkpointing by writing the state into the BigQuery table. Now in a new window, if code doesn't find a state for a key, it does a Big Query read and reload the state so that state is again maintained for a new window. My above implementation works fine as long as we have few keys to handle. With a lot of user traffic which results in a lot of KV pairs, sometimes leads to a lot of parallel Big Query read of state at the start of a new window. Since in our system we also have other modules, cloud functions etc that keep reading from Big Query, sometime under heavy load, dataflow receiving following exception while reading state/checkpoint from Big Query: exception: "com.google.cloud.bigquery.BigQueryException: Job exceeded rate limits: Your project_and_region exceeded We have asked GCP folks to increase the limit, but then as per them it is not good for performance and also increases cost. My question is: *Is my above approach of checkpointing with Big Query correct ? Can someone suggest a better approach for checkpointing in case of dataflow?* Thanks and Regards Mohil
