Re: How can I improve this Flink application for "Distinct Count of elements" in the data stream?

Felipe Gutierrez Wed, 12 Jun 2019 06:20:43 -0700

Hi Rong, I implemented my solution using a ProcessingWindow with timeWindow
and a ReduceFunction with timeWindowAll [1]. So for the first window I use
parallelism and the second window I use to merge everything on the Reducer.
I guess it is the best approach to do DistinctCount. Would you suggest some
improvements?


[1]
https://github.com/felipegutierrez/explore-flink/blob/master/src/main/java/org/sense/flink/examples/stream/WordDistinctCountProcessTimeWindowSocket.java

Thanks!
*--*
*-- Felipe Gutierrez*

*-- skype: felipe.o.gutierrez*
*--* *https://felipeogutierrez.blogspot.com
<https://felipeogutierrez.blogspot.com>*


On Wed, Jun 12, 2019 at 9:27 AM Felipe Gutierrez <
felipe.o.gutier...@gmail.com> wrote:

> Hi Rong,
>
> thanks for your answer. If I understood well, the option will be to use
> ProcessFunction [1] since it has the method onTimer(). But not the
> ProcessWindowFunction [2], because it does not have the method onTimer(). I
> will need this method to call Collector<OUT> out.collect(...) from the
> onTImer() method in order to emit a single value of my Distinct Count
> function.
>
> Is that reasonable what I am saying?
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-master/api/java/index.html?org/apache/flink/streaming/api/datastream/DataStream.html
> [2]
> https://ci.apache.org/projects/flink/flink-docs-master/api/java/index.html?org/apache/flink/streaming/api/functions/windowing/ProcessWindowFunction.html
>
> Kind Regards,
> Felipe
>
> *--*
> *-- Felipe Gutierrez*
>
> *-- skype: felipe.o.gutierrez*
> *--* *https://felipeogutierrez.blogspot.com
> <https://felipeogutierrez.blogspot.com>*
>
>
> On Wed, Jun 12, 2019 at 3:41 AM Rong Rong <walter...@gmail.com> wrote:
>
>> Hi Felipe,
>>
>> there are multiple ways to do DISTINCT COUNT in Table/SQL API. In fact
>> there's already a thread going on recently [1]
>> Based on the description you provided, it seems like it might be a better
>> API level to use.
>>
>> To answer your question,
>> - You should be able to use other TimeCharacteristic. You might want to
>> try WindowProcessFunction and see if this fits your use case.
>> - Not sure I fully understand the question, your keyed by should be done
>> on your distinct key (or a combo key) and if you do keyby correctly then
>> yes all msg with same key is processed by the same TM thread.
>>
>> --
>> Rong
>>
>>
>>
>> [1]
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/count-DISTINCT-in-flink-SQL-td28061.html
>>
>> On Tue, Jun 11, 2019 at 1:27 AM Felipe Gutierrez <
>> felipe.o.gutier...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I have implemented a Flink data stream application to compute distinct
>>> count of words. Flink does not have a built-in operator which does this
>>> computation. I used KeyedProcessFunction and I am saving the state on a
>>> ValueState descriptor.
>>> Could someone check if my implementation is the best way of doing it?
>>> Here is my solution:
>>> https://stackoverflow.com/questions/56524962/how-can-i-improve-my-count-distinct-for-data-stream-implementation-in-flink/56539296#56539296
>>>
>>> I have some points that I could not understand better:
>>> - I only could use TimeCharacteristic.IngestionTime.
>>> - I split the words using "Tuple2<Integer, String>(0, word)", so I will
>>> have always the same key (0). As I understand, all the events will be
>>> processed on the same TaskManager which will not achieve parallelism if I
>>> am in a cluster.
>>>
>>> Kind Regards,
>>> Felipe
>>> *--*
>>> *-- Felipe Gutierrez*
>>>
>>> *-- skype: felipe.o.gutierrez*
>>> *--* *https://felipeogutierrez.blogspot.com
>>> <https://felipeogutierrez.blogspot.com>*
>>>
>>

Re: How can I improve this Flink application for "Distinct Count of elements" in the data stream?

Reply via email to