This may be possible with a custom WindowFn. Where is the configuration of
what aggregations to do coming from?
On Wed, Sep 21, 2016 at 11:27 PM, Chawla,Sumit
wrote:
> Attaching the Image.
>
>
>
>
> Regards
> Sumit Chawla
>
>
> On Wed, Sep 21, 2016 at 11:24 PM, Chawla,Sumit
> wrote:
>
>> Hi All
>>
>> I am trying to code a solution for following scenarios.
>>
>> 1. I have a stream of Tuples with multiple numeric fields (e.g. A, B, C,
>> D, E ... etc )
>> 2. I want the ability to do different Windowing and Aggregation on each
>> field or a group of fields in the Tuple. e.g. Sum A over a Period of 2
>> minutes, Avg B over a period of 3 minutes, Sum of C grouped by D over a
>> period of 15 minutes
>> 3. *These window requirements can be added by user at runtime*. My
>> pipeline should be able to compute a new aggregation at runtime.
>> 4. Plan to support only simple aggregation windows like SUM, AVG, MAX,
>> MIN, COUNT etc.
>>
>>
>> As i understand in BEAM pipelines ( with Flink Runner), the DAG of
>> computations cannot be altered once the pipeline is deployed. I am trying
>> to see how can i support above use case. I would love to hear your
>> feedback on this, and suggestions on doing it in a completely different
>> way.
>>
>> *My Design:*
>>
>> 1. Create 1 minute buckets per Field or Group of Fields and compute
>> basic aggregations for bucket. e.g. Buckets are highlighted in Yellow
>> here. For each field i calculate [SUM, COUNT, MAX, MIN] in the bucket. (
>> Bucket size of 1 minute is defined for simplicity, and would limit the
>> minimum window size to 1 minute)
>>
>> 2. Downstream process these buckets, and compute the user defined
>> aggregations. Following diagram depicts Tumbling Window computations. The
>> Aggregation functions in GREEN are just NATIVE functions consuming
>> different buckets, and doing aggregations on top of these buckets.
>>
>>
>>
>>
>>
>>
>>
>> *P.S.*
>>
>> * Some of the design choices that i have decided not to go for are:*
>>
>> 1. Multiple Pipelines for doing computation. One master pipeline does
>> grouping, and sends to a different topic based on user configured window
>> size. (e.g. topic_window_by_5_min, topic_window_by_15_min), and have a
>> different pipeline consume each topic.
>>
>> 2. Single pipeline doing all the business with predefined Windows
>> defined for Downstream processing. e.g. 5, 15, 30, 60 minute windows will
>> be defined which will consume from different Side Inputs. User is only
>> allowed only to select these Window sizes. Upstream Group By operator will
>> route to the data to different Window Function based on user configuration.
>>
>>
>>
>> Regards
>> Sumit Chawla
>>
>>
>