Re: How can I improve this Flink application for "Distinct Count of elements" in the data stream?

2019-06-13 Thread Felipe Gutierrez
humm.. it seems that it is my turn to implement all this stuff using Table API. Thanks Rong! *--* *-- Felipe Gutierrez* *-- skype: felipe.o.gutierrez* *--* *https://felipeogutierrez.blogspot.com * On Thu, Jun 13, 2019 at 6:00 PM Rong Rong wrote: > Hi

Re: How can I improve this Flink application for "Distinct Count of elements" in the data stream?

2019-06-13 Thread Rong Rong
Hi Felipe, Hequn is right. The problem you are facing is better using TableAPI level code instead of dealing with in DataStream. You will have more Flink library support to achieve your goal. In addition, Flink TableAPI also support UserDefineAggregateFunction [1] to achieve your hyperLogLog

Re: How can I improve this Flink application for "Distinct Count of elements" in the data stream?

2019-06-13 Thread Felipe Gutierrez
Hi Hequn, indeed the ReduceFunction is better than the ProcessWindowFunction. I replaced and could check the improvement performance [1]. Thanks for that! I will try a distinct count with the Table API. The question that I am facing is that I want to use a HyperLogLog on a UDF for DataStream.

Re: How can I improve this Flink application for "Distinct Count of elements" in the data stream?

2019-06-13 Thread Hequn Cheng
Hi Felipe, >From your code, I think you want to get the "count distinct" result instead of the "distinct count". They contain a different meaning. To improve the performance, you can replace your DistinctProcessWindowFunction to a DistinctProcessReduceFunction. A ReduceFunction can aggregate the

Re: How can I improve this Flink application for "Distinct Count of elements" in the data stream?

2019-06-12 Thread Felipe Gutierrez
Hi Rong, I implemented my solution using a ProcessingWindow with timeWindow and a ReduceFunction with timeWindowAll [1]. So for the first window I use parallelism and the second window I use to merge everything on the Reducer. I guess it is the best approach to do DistinctCount. Would you suggest

Re: How can I improve this Flink application for "Distinct Count of elements" in the data stream?

2019-06-12 Thread Felipe Gutierrez
Hi Rong, thanks for your answer. If I understood well, the option will be to use ProcessFunction [1] since it has the method onTimer(). But not the ProcessWindowFunction [2], because it does not have the method onTimer(). I will need this method to call Collector out.collect(...) from the

Re: How can I improve this Flink application for "Distinct Count of elements" in the data stream?

2019-06-11 Thread Rong Rong
Hi Felipe, there are multiple ways to do DISTINCT COUNT in Table/SQL API. In fact there's already a thread going on recently [1] Based on the description you provided, it seems like it might be a better API level to use. To answer your question, - You should be able to use other

How can I improve this Flink application for "Distinct Count of elements" in the data stream?

2019-06-11 Thread Felipe Gutierrez
Hi all, I have implemented a Flink data stream application to compute distinct count of words. Flink does not have a built-in operator which does this computation. I used KeyedProcessFunction and I am saving the state on a ValueState descriptor. Could someone check if my implementation is the