Hello team,

I am using beam java sdk 2.23.0 on dataflow.

I have a pipeline where I continuously read from Kafka and run through
various transforms and finally emit output to various sinks like Elastic
Search, Big Query, GCS buckets etc.

There are few transforms where I maintain state of input KV pairs (using
Stateful Beam model) and update the incoming data based on metadata
maintained in state.

Now the use case is that each of those KV pairs belong to some user
activity and we don't know how long this activity will last. It can run
from a few hours to a few days.  And in between if the beam runs into some
issue or for some maintenance we need to restart the beam job, we should be
able to retain the state.

In order to accommodate the above requirement without maintaining a global
window, I maintain a fixed window of 24hrs and at window expiry, I do
Checkpointing by writing the state into the BigQuery table. Now in a new
window, if code doesn't find a state for a key, it does a Big Query read
and reload the state so that state is again maintained for a new window.

My above implementation works fine as long as we have few keys to handle.
With a lot of user traffic which results in a lot of KV pairs, sometimes
leads to a lot of parallel Big Query read of state at the start of a new
window.  Since in our system we also have other modules, cloud functions
etc that keep reading from Big Query, sometime under heavy load, dataflow
receiving following exception while reading state/checkpoint from Big Query:

exception: "com.google.cloud.bigquery.BigQueryException: Job exceeded rate
limits: Your project_and_region exceeded

We have asked GCP folks to increase the limit, but then as per them it is
not good for performance and also increases cost.

My question is:
*Is my above approach of checkpointing with Big Query correct ? Can
someone suggest a better approach for checkpointing in case of dataflow?*

Thanks and Regards
Mohil

Reply via email to