Re: deduplication with streaming sql

2018-02-06 Thread Henri Heiskanen
Hi, Oh, right.. got it. Thanks! Br, Henkka On Tue, Feb 6, 2018 at 5:01 PM, Fabian Hueske wrote: > Hi Henkka, > > This should be fairly easy to implement in a ProcessFunction. > You're making a good call to worry about the number of timers. If you > register a timer multiple times on the same t

Re: deduplication with streaming sql

2018-02-06 Thread Fabian Hueske
Hi Henkka, This should be fairly easy to implement in a ProcessFunction. You're making a good call to worry about the number of timers. If you register a timer multiple times on the same time, the timer is deduplicated ;-) and will only fire once for that time. That's why the state retention time

Re: deduplication with streaming sql

2018-02-06 Thread Henri Heiskanen
Hi, Thanks. Doing this deduplication would be easy just by using vanilla flink api and state (check if this is a new key and then emit), but the issue has been automatic state cleanup. However, it looks like this streaming sql retention time implementation uses the process function and timer. I w

Re: deduplication with streaming sql

2018-02-06 Thread Fabian Hueske
Hi Henri, thanks for reaching out and providing code and data to reproduce the issue. I think you are right, a "SELECT DISTINCT a, b, c FROM X" should not result in a retraction stream. However, with the current implementation we internally need a retraction stream if a state retention time is

Re: deduplication with streaming sql

2018-02-06 Thread Timo Walther
Hi Henri, I just noticed that I had a tiny mistake in my little test program. So SELECT DISTINCT is officially supported. But the question if this is a valid append stream is still up for discussion. I will loop in Fabian (in CC). For the general behavior you can also look into the code and

Re: deduplication with streaming sql

2018-02-06 Thread Timo Walther
Hi Henri, I try to answer your question: 1) You are right, SELECT DISTINCT should not need a retract stream. Internally, this is translated into an aggregation without an aggregate function call. So this definitely needs improvement. 2) The problem is that SELECT DISTINCT is not officially s

deduplication with streaming sql

2018-02-06 Thread Henri Heiskanen
Hi, I have a use case where I would like to find distinct rows over certain period of time. Requirement is that new row is emitted asap. Otherwise the requirement is mainly to just filter out data to have smaller dataset for downstream. I noticed that SELECT DISTINCT and state retention time of 12