Google Dataflow State Insights

Tudor Plugaru Mon, 22 Nov 2021 08:28:26 -0800

Hi,
at the current company, we have multiple Beam pipelines developed using
Python SDK and we are using Dataflow as a runner. In one specific pipeline,
we are using the state extensively to store information about the last
event for a specific key. Since without data, we might have quite a lot of
different keys, we were wondering if there are some limitations regarding
the amount of data that we can keep in the state. The pipelines are running
in streaming mode and are using the Streaming Engine. I also did some
research about this area before writing this email, but wasn't able to get
a better understanding of how the state is implemented under the hood and
what kind of guarantees it provides. I have found this article
<https://flink.apache.org/2021/01/18/rocksdb.html> that explains how the
state is being stored for Flink runner and since Beam can run on Flink, I
assume that the approach might be the same in Dataflow but no docs are
describing this. We do have a garbage collection logic implemented inside
the pipeline that will clear up the keys which did not have activity for
some time and it runs every 10 min, but we would like to extend this to
maybe 24h but not sure if it's possible because of the unknowns regarding
state storage.


Can you please provide some answers to the questions that I have regarding
how the state is stored?
1. I assume it's key-value storage, but what kind of service is used? Are
we able to have access to it?
2. What kind of guarantees does it provide? How the state transitioned to a
new, updated pipeline? Can we reuse the state when we deploy a new version
of the pipeline without updating?
3. Are there any resources where I can dig more into the implementation of
the state?

Regards,
Tudor

Google Dataflow State Insights

Reply via email to