[jira] [Created] (FLINK-8601) Introduce LinkedBloomFilterState for Approximate calculation and other situations of performance optimization

Sihua Zhou (JIRA) Wed, 07 Feb 2018 23:31:53 -0800

Sihua Zhou created FLINK-8601:
---------------------------------

             Summary: Introduce LinkedBloomFilterState for Approximate 
calculation and other situations of performance optimization
                 Key: FLINK-8601
                 URL: https://issues.apache.org/jira/browse/FLINK-8601
             Project: Flink
          Issue Type: New Feature
          Components: Core, DataStream API
    Affects Versions: 1.4.0
            Reporter: Sihua Zhou
            Assignee: Sihua Zhou

h3. Backgroud

Bloom filter is useful in many situation, for example:
* 1. Approximate calculation: deduplication (eg: UV calculation)
* 2. Performance optimization: eg, [runtime filter
join|https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_runtime_filtering.html]

However, based on the current status provided by flink, it is hard to use the
bloom filter for the following reasons:
* 1. Serialization problem: Bloom filter status can be large (for example:
100M), if implement it based on the RocksDB state, the state data will need to
be serialized each time it is queried and updated, and the performance will be
very poor.
* 2. Data skewed: Data in different key group can be skewed, and the
information of data skewed can not be accurately predicted before the program
is running. Therefore, it is impossible to determine how much resources bloom
filter should allocate. One way to do this is to allocate space needed for the
most skewed case, but this can lead to very serious waste of resources.

h3. Requirement

Therefore, I introduce the LinkedBloomFilterState for flink, which at least
need to meet the following features:
* 1. Support for changing Parallelism
* 2. Only serialize when necessary: when performing checkpoint
* 3. Can deal with data skew problem: users only need to specify a
LinkedBloomFilterState with the desired input, fpp, system will allocate
resource dynamic.
* 4. Do not conflict with other state: user can use KeyedState and
OperateState when using bloom filter state.
* 5. Support relax ttl (ie: the data survival time at least greater than the
specified time)

Design doc: to be soon

--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-8601) Introduce LinkedBloomFilterState for Approximate calculation and other situations of performance optimization

Reply via email to