Re: Issue with Checkpointing in dataflow using BigQuery

Mohil Khare Mon, 21 Sep 2020 11:34:42 -0700

Hello Luke,
Thanks for your reply.

1. YES, we do use the update option for updating our pipeline and yes by
doing so we don't worry about losing state as we don't need to restart beam
job.  But I think if we are doing some major algorithm changes or changes
like windowing algorithms or triggers etc, then we have to restart the
pipeline in order to avoid unpredictable results. But yes, I agree most of
the time the update option works fine.


2. I also use session windowing in some of the transforms. Here also I can
use it , but as I mentioned that there is no deterministic inactivity
period or rather GAP period that can determine a session. I will have to
then come up with a very large GAP period after which there is a guarantee
that a new session will start. And again this won't solve the problem when
we have to restart beam (when --update can't work).

3.  Yeah, you are correct, it's a simple KV put and get operation for
checkpointing. We have been using BIg query in our system, so we just
decided to leverage that, but yeah I can explore Bigtable or other options.
Thanks for suggesting these options.

Thanks and Regards
Mohil

On Mon, Sep 21, 2020 at 9:37 AM Luke Cwik <[email protected]> wrote:

> Have you tried doing a pipeline update with Dataflow[1]? If yes, what
> prevented you from using this all the time?
>
> Some users have been able to use session windows to solve problems in this
> space, have you explored this[2]? If yes, what didn't work for you?
>
> It seems as though you are doing a bunch of nosql like queries (e.g. save
> data for key X, load data for key Y) and not doing complex aggregations or
> needing multi-key transactions, have you considered using a datastore that
> is designed for this (e.g. Cloud Bigtable/Cloud Firestore/...)[3]?
>
> 1: https://cloud.google.com/dataflow/docs/guides/updating-a-pipeline
> 2:
> https://beam.apache.org/documentation/programming-guide/#provided-windowing-functions
> 3: https://cloud.google.com/products/databases
>
>
> On Sun, Sep 20, 2020 at 7:53 PM Mohil Khare <[email protected]> wrote:
>
>> Hello team,
>>
>> I am using beam java sdk 2.23.0 on dataflow.
>>
>> I have a pipeline where I continuously read from Kafka and run through
>> various transforms and finally emit output to various sinks like Elastic
>> Search, Big Query, GCS buckets etc.
>>
>> There are few transforms where I maintain state of input KV pairs (using
>> Stateful Beam model) and update the incoming data based on metadata
>> maintained in state.
>>
>> Now the use case is that each of those KV pairs belong to some user
>> activity and we don't know how long this activity will last. It can run
>> from a few hours to a few days.  And in between if the beam runs into some
>> issue or for some maintenance we need to restart the beam job, we should be
>> able to retain the state.
>>
>> In order to accommodate the above requirement without maintaining a
>> global window, I maintain a fixed window of 24hrs and at window expiry, I
>> do Checkpointing by writing the state into the BigQuery table. Now in a new
>> window, if code doesn't find a state for a key, it does a Big Query read
>> and reload the state so that state is again maintained for a new window.
>>
>> My above implementation works fine as long as we have few keys to handle.
>> With a lot of user traffic which results in a lot of KV pairs, sometimes
>> leads to a lot of parallel Big Query read of state at the start of a new
>> window.  Since in our system we also have other modules, cloud functions
>> etc that keep reading from Big Query, sometime under heavy load, dataflow
>> receiving following exception while reading state/checkpoint from Big Query:
>>
>> exception: "com.google.cloud.bigquery.BigQueryException: Job exceeded
>> rate limits: Your project_and_region exceeded
>>
>> We have asked GCP folks to increase the limit, but then as per them it is
>> not good for performance and also increases cost.
>>
>> My question is:
>> *Is my above approach of checkpointing with Big Query correct ? Can
>> someone suggest a better approach for checkpointing in case of dataflow?*
>>
>> Thanks and Regards
>> Mohil
>>
>>
>>
>>
>>
>>
>>

Re: Issue with Checkpointing in dataflow using BigQuery

Reply via email to