Hi, at the current company, we have multiple Beam pipelines developed using Python SDK and we are using Dataflow as a runner. In one specific pipeline, we are using the state extensively to store information about the last event for a specific key. Since without data, we might have quite a lot of different keys, we were wondering if there are some limitations regarding the amount of data that we can keep in the state. The pipelines are running in streaming mode and are using the Streaming Engine. I also did some research about this area before writing this email, but wasn't able to get a better understanding of how the state is implemented under the hood and what kind of guarantees it provides. I have found this article <https://flink.apache.org/2021/01/18/rocksdb.html> that explains how the state is being stored for Flink runner and since Beam can run on Flink, I assume that the approach might be the same in Dataflow but no docs are describing this. We do have a garbage collection logic implemented inside the pipeline that will clear up the keys which did not have activity for some time and it runs every 10 min, but we would like to extend this to maybe 24h but not sure if it's possible because of the unknowns regarding state storage.
Can you please provide some answers to the questions that I have regarding how the state is stored? 1. I assume it's key-value storage, but what kind of service is used? Are we able to have access to it? 2. What kind of guarantees does it provide? How the state transitioned to a new, updated pipeline? Can we reuse the state when we deploy a new version of the pipeline without updating? 3. Are there any resources where I can dig more into the implementation of the state? Regards, Tudor
