Re: Flink: For terabytes of keyed state.

Congxian Qiu Tue, 05 May 2020 20:05:43 -0700

Hi

>From my experience, you should care the state size for a single task(not
the whole job state size), the download speed for single thread is almost
100 MB/s (this may vary in different env), and I do not have much
performance for loading state into RocksDB(we use an internal KV store in
my company), but loading state into RocksDB will not slower than
downloading from my experience.


Best,
Congxian


Gowri Sundaram <gowripsunda...@gmail.com> 于2020年5月3日周日 下午11:25写道：

> Hi Congxian,
> Thank you so much for your response, that really helps!
>
> From your experience, how long does it take for Flink to redistribute
> terabytes of state data on node addition / node failure.
>
> Thanks!
>
> On Sun, May 3, 2020 at 6:56 PM Congxian Qiu <qcx978132...@gmail.com>
> wrote:
>
>> Hi
>>
>> 1. From my experience, Flink can support such big state, you can set
>> appropriate parallelism for the stateful operator. for RocksDB you may need
>> to care about the disk performance.
>> 2. Inside Flink, the state is separated by key-group, each
>> task/parallelism contains multiple key-groups.  Flink does not need to
>> restart when you add a node to the cluster, but every time restart from
>> savepoint/checkpoint(or failover), Flink needs to redistribute the
>> checkpoint data, this can be omitted if it's failover and local recovery[1]
>> is enabled
>> 3. for upload/download state, you can ref to the multiple thread
>> upload/download[2][3] for speed up them
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/large_state_tuning.html#task-local-recovery
>> [2] https://issues.apache.org/jira/browse/FLINK-10461
>> [3] https://issues.apache.org/jira/browse/FLINK-11008
>>
>> Best,
>> Congxian
>>
>>
>> Gowri Sundaram <gowripsunda...@gmail.com> 于2020年5月1日周五 下午6:29写道：
>>
>>> Hello all,
>>> We have read in multiple
>>> <https://flink.apache.org/features/2018/01/30/incremental-checkpointing.html>
>>> sources <https://flink.apache.org/usecases.html> that Flink has been
>>> used for use cases with terabytes of application state.
>>>
>>> We are considering using Flink for a similar use case with* keyed state
>>> in the range of 20 to 30 TB*. We had a few questions regarding the same.
>>>
>>>
>>>    - *Is Flink a good option for this kind of scale of data* ? We are
>>>    considering using RocksDB as the state backend.
>>>    - *What happens when we want to add a node to the cluster *?
>>>       - As per our understanding, if we have 10 nodes in our cluster,
>>>       with 20TB of state, this means that adding a node would require the 
>>> entire
>>>       20TB of data to be shipped again from the external checkpoint remote
>>>       storage to the taskmanager nodes.
>>>       - Assuming 1Gb/s network speed, and assuming all nodes can read
>>>       their respective 2TB state parallely, this would mean a *minimum
>>>       downtime of half an hour*. And this is assuming the throughput of
>>>       the remote storage does not become the bottleneck.
>>>       - Is there any way to reduce this estimated downtime ?
>>>
>>>
>>> Thank you!
>>>
>>

Re: Flink: For terabytes of keyed state.

Reply via email to