Hello,

Thank you all for your responses — I now have a much clearer understanding of 
how state works in Apache Beam. The data we currently store in Bigtable is 
critical, and we want to ensure it is never lost.

Duplicates are not an issue for us, as we always perform idempotent updates to 
unique keys in Bigtable. That said, we are exploring the use of Beam state for 
performance optimization. However, updating the pipeline job isn’t always 
feasible — particularly in scenarios where the pipeline becomes stuck and needs 
to be canceled, or when major changes to the pipeline graph may break state 
compatibility.

Given these constraints, I’d like to better understand the common or 
recommended best practices in such situations. For example:

1. Are there deployment strategies that help prevent or minimize state loss?

2. Is it possible to flush or back up state data externally in case the 
pipeline becomes unresponsive?

3. Or alternative external state management approaches that may be better 
suited for this situation.

Thank you in advance

Best regards.


> On 29 Apr 2025, at 23:11, Reuven Lax via user <user@beam.apache.org> wrote:
> 
> Pipeline state persists across pipeline updates - i.e. if you update the job 
> to. a new one. If you cancel the job and restart, then you generally lose the 
> state.
> 
> Writing to an external store such as BigTable from your DoFn can be tricky 
> both from a performance perspective and a correctness perspective. Beam 
> runners may retry bundles, and while internal state consistency is guaranteed 
> across retries, external state may not be. You might see duplicates or worse 
> in your cloud bigtable.
> 
> Reuven
> 
> On Mon, Apr 28, 2025 at 2:12 AM Shaochen Bai <shaoc...@kisi.io 
> <mailto:shaoc...@kisi.io>> wrote:
>> Hello,
>> 
>> Thank you for your response. I was not aware that state in Apache Beam 
>> persists across different jobs — there seem to be very few open resources 
>> discussing this. Here is one of the few I found.
>> 
>> I do have some concerns regarding state management:
>> 
>> 1. Does the state persist if the pipeline gets stuck and we have to cancel 
>> or force-cancel the job?
>> 
>> 2. Does the state persist if we modify the structure of the pipeline and use 
>> the state in a different DoFn?
>> 
>> 3. It appears that we need to specify the persistent disk size when 
>> deploying the pipeline. Since we may need to scale the disk size as the 
>> state grows, will all existing state persist correctly after scaling?
>> 
>> Since we do not have a clear understanding of the state persistence 
>> mechanism and its expected behavior, we are hesitant to adopt it fully. If 
>> you could point me to any public references or resources on this topic, I 
>> would greatly appreciate it.
>> 
>> Thank you again for your help.
>> 
>> Best regards,
>> 
>> 
>> 
>> Reference:
>> <apple-touch-i...@2.png>
>> Dataflow - State persistence specs
>> stackoverflow.com
>>  
>> <https://www.google.com/url?q=https://stackoverflow.com/questions/69835743/dataflow-state-persistence-specs&source=gmail-imap&ust=1746565902000000&usg=AOvVaw0QmCWgONfVkcAvOcswWh9A>Dataflow
>>  - State persistence specs 
>> <https://www.google.com/url?q=https://stackoverflow.com/questions/69835743/dataflow-state-persistence-specs&source=gmail-imap&ust=1746565902000000&usg=AOvVaw0QmCWgONfVkcAvOcswWh9A>
>> stackoverflow.com 
>> <https://www.google.com/url?q=https://stackoverflow.com/questions/69835743/dataflow-state-persistence-specs&source=gmail-imap&ust=1746565902000000&usg=AOvVaw0QmCWgONfVkcAvOcswWh9A>
>> 
>>> On 25 Apr 2025, at 17:12, XQ Hu via user <user@beam.apache.org 
>>> <mailto:user@beam.apache.org>> wrote:
>>> 
>>> Apache Beam provides a built-in mechanism specifically for managing 
>>> per-key-and-window state that persists across workers and pipeline 
>>> restarts. Is there anything you can not use 
>>> https://beam.apache.org/documentation/programming-guide/#state-and-timers 
>>> <https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://beam.apache.org/documentation/programming-guide/%2523state-and-timers%26source%3Dgmail-imap%26ust%3D1746198771000000%26usg%3DAOvVaw0zLj4Td0V5wpPSGYig8lTf&source=gmail-imap&ust=1746565902000000&usg=AOvVaw0vnlYAmxJjyH51UC-PoKpN>?
>>> 
>>> On Fri, Apr 25, 2025 at 8:45 AM Shaochen Bai <shaoc...@kisi.io 
>>> <mailto:shaoc...@kisi.io>> wrote:
>>>> Hi all,
>>>> 
>>>> I’m working on an online Apache Beam streaming pipeline where I need to 
>>>> store, read, and modify values across different windowed data — including 
>>>> across pipeline restarts.
>>>> 
>>>> To handle this, I’m currently using Google Cloud Bigtable as my persistent 
>>>> storage backend. In my implementation:
>>>> 
>>>> I initialize a BigtableDataClient in the @Setup method of a DoFn
>>>> 
>>>> I use this client within processElement to read and write to Bigtable
>>>> 
>>>> However, I’ve noticed that this setup may lead to increased thread and 
>>>> memory usage, especially when many DoFn instances are created in parallel.
>>>> 
>>>> I’d really appreciate your input on a few questions:
>>>> 
>>>> Is using an external store like Bigtable the recommended approach to 
>>>> persist state across windows (and restarts)?
>>>> 
>>>> Are there optimizations or best practices for managing Bigtable 
>>>> connections efficiently in this context?
>>>> 
>>>> e.g., connection pooling, limiting client creation, or Beam-native 
>>>> alternatives for external state?
>>>> 
>>>> Any advice would be greatly appreciated
>>>> 
>>>> Thanks in advance!
>>>> 
>>>> 
>>>> ---
>>>> This email is confidential/privileged. If you're not the intended 
>>>> recipient, please delete it and notify us immediately; please do not 
>>>> copy/use/disclose it for any purpose, to anyone. Thank you!
>> 
>> 
>> ---
>> This email is confidential/privileged. If you're not the intended recipient, 
>> please delete it and notify us immediately; please do not copy/use/disclose 
>> it for any purpose, to anyone. Thank you!


-- 
---
This email is confidential/privileged. If you're not the intended 
recipient, please delete it and notify us immediately; please do not 
copy/use/disclose it for any purpose, to anyone. Thank you!

Reply via email to