Hello Luke, Thanks for your reply. 1. YES, we do use the update option for updating our pipeline and yes by doing so we don't worry about losing state as we don't need to restart beam job. But I think if we are doing some major algorithm changes or changes like windowing algorithms or triggers etc, then we have to restart the pipeline in order to avoid unpredictable results. But yes, I agree most of the time the update option works fine.
2. I also use session windowing in some of the transforms. Here also I can use it , but as I mentioned that there is no deterministic inactivity period or rather GAP period that can determine a session. I will have to then come up with a very large GAP period after which there is a guarantee that a new session will start. And again this won't solve the problem when we have to restart beam (when --update can't work). 3. Yeah, you are correct, it's a simple KV put and get operation for checkpointing. We have been using BIg query in our system, so we just decided to leverage that, but yeah I can explore Bigtable or other options. Thanks for suggesting these options. Thanks and Regards Mohil On Mon, Sep 21, 2020 at 9:37 AM Luke Cwik <[email protected]> wrote: > Have you tried doing a pipeline update with Dataflow[1]? If yes, what > prevented you from using this all the time? > > Some users have been able to use session windows to solve problems in this > space, have you explored this[2]? If yes, what didn't work for you? > > It seems as though you are doing a bunch of nosql like queries (e.g. save > data for key X, load data for key Y) and not doing complex aggregations or > needing multi-key transactions, have you considered using a datastore that > is designed for this (e.g. Cloud Bigtable/Cloud Firestore/...)[3]? > > 1: https://cloud.google.com/dataflow/docs/guides/updating-a-pipeline > 2: > https://beam.apache.org/documentation/programming-guide/#provided-windowing-functions > 3: https://cloud.google.com/products/databases > > > On Sun, Sep 20, 2020 at 7:53 PM Mohil Khare <[email protected]> wrote: > >> Hello team, >> >> I am using beam java sdk 2.23.0 on dataflow. >> >> I have a pipeline where I continuously read from Kafka and run through >> various transforms and finally emit output to various sinks like Elastic >> Search, Big Query, GCS buckets etc. >> >> There are few transforms where I maintain state of input KV pairs (using >> Stateful Beam model) and update the incoming data based on metadata >> maintained in state. >> >> Now the use case is that each of those KV pairs belong to some user >> activity and we don't know how long this activity will last. It can run >> from a few hours to a few days. And in between if the beam runs into some >> issue or for some maintenance we need to restart the beam job, we should be >> able to retain the state. >> >> In order to accommodate the above requirement without maintaining a >> global window, I maintain a fixed window of 24hrs and at window expiry, I >> do Checkpointing by writing the state into the BigQuery table. Now in a new >> window, if code doesn't find a state for a key, it does a Big Query read >> and reload the state so that state is again maintained for a new window. >> >> My above implementation works fine as long as we have few keys to handle. >> With a lot of user traffic which results in a lot of KV pairs, sometimes >> leads to a lot of parallel Big Query read of state at the start of a new >> window. Since in our system we also have other modules, cloud functions >> etc that keep reading from Big Query, sometime under heavy load, dataflow >> receiving following exception while reading state/checkpoint from Big Query: >> >> exception: "com.google.cloud.bigquery.BigQueryException: Job exceeded >> rate limits: Your project_and_region exceeded >> >> We have asked GCP folks to increase the limit, but then as per them it is >> not good for performance and also increases cost. >> >> My question is: >> *Is my above approach of checkpointing with Big Query correct ? Can >> someone suggest a better approach for checkpointing in case of dataflow?* >> >> Thanks and Regards >> Mohil >> >> >> >> >> >> >>
