Hi Rong, I implemented my solution using a ProcessingWindow with timeWindow and a ReduceFunction with timeWindowAll [1]. So for the first window I use parallelism and the second window I use to merge everything on the Reducer. I guess it is the best approach to do DistinctCount. Would you suggest some improvements?
[1] https://github.com/felipegutierrez/explore-flink/blob/master/src/main/java/org/sense/flink/examples/stream/WordDistinctCountProcessTimeWindowSocket.java Thanks! *--* *-- Felipe Gutierrez* *-- skype: felipe.o.gutierrez* *--* *https://felipeogutierrez.blogspot.com <https://felipeogutierrez.blogspot.com>* On Wed, Jun 12, 2019 at 9:27 AM Felipe Gutierrez < felipe.o.gutier...@gmail.com> wrote: > Hi Rong, > > thanks for your answer. If I understood well, the option will be to use > ProcessFunction [1] since it has the method onTimer(). But not the > ProcessWindowFunction [2], because it does not have the method onTimer(). I > will need this method to call Collector<OUT> out.collect(...) from the > onTImer() method in order to emit a single value of my Distinct Count > function. > > Is that reasonable what I am saying? > > [1] > https://ci.apache.org/projects/flink/flink-docs-master/api/java/index.html?org/apache/flink/streaming/api/datastream/DataStream.html > [2] > https://ci.apache.org/projects/flink/flink-docs-master/api/java/index.html?org/apache/flink/streaming/api/functions/windowing/ProcessWindowFunction.html > > Kind Regards, > Felipe > > *--* > *-- Felipe Gutierrez* > > *-- skype: felipe.o.gutierrez* > *--* *https://felipeogutierrez.blogspot.com > <https://felipeogutierrez.blogspot.com>* > > > On Wed, Jun 12, 2019 at 3:41 AM Rong Rong <walter...@gmail.com> wrote: > >> Hi Felipe, >> >> there are multiple ways to do DISTINCT COUNT in Table/SQL API. In fact >> there's already a thread going on recently [1] >> Based on the description you provided, it seems like it might be a better >> API level to use. >> >> To answer your question, >> - You should be able to use other TimeCharacteristic. You might want to >> try WindowProcessFunction and see if this fits your use case. >> - Not sure I fully understand the question, your keyed by should be done >> on your distinct key (or a combo key) and if you do keyby correctly then >> yes all msg with same key is processed by the same TM thread. >> >> -- >> Rong >> >> >> >> [1] >> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/count-DISTINCT-in-flink-SQL-td28061.html >> >> On Tue, Jun 11, 2019 at 1:27 AM Felipe Gutierrez < >> felipe.o.gutier...@gmail.com> wrote: >> >>> Hi all, >>> >>> I have implemented a Flink data stream application to compute distinct >>> count of words. Flink does not have a built-in operator which does this >>> computation. I used KeyedProcessFunction and I am saving the state on a >>> ValueState descriptor. >>> Could someone check if my implementation is the best way of doing it? >>> Here is my solution: >>> https://stackoverflow.com/questions/56524962/how-can-i-improve-my-count-distinct-for-data-stream-implementation-in-flink/56539296#56539296 >>> >>> I have some points that I could not understand better: >>> - I only could use TimeCharacteristic.IngestionTime. >>> - I split the words using "Tuple2<Integer, String>(0, word)", so I will >>> have always the same key (0). As I understand, all the events will be >>> processed on the same TaskManager which will not achieve parallelism if I >>> am in a cluster. >>> >>> Kind Regards, >>> Felipe >>> *--* >>> *-- Felipe Gutierrez* >>> >>> *-- skype: felipe.o.gutierrez* >>> *--* *https://felipeogutierrez.blogspot.com >>> <https://felipeogutierrez.blogspot.com>* >>> >>