Re: deduplication with streaming sql

2018-02-06 Thread Henri Heiskanen
Hi, Oh, right.. got it. Thanks! Br, Henkka On Tue, Feb 6, 2018 at 5:01 PM, Fabian Hueske wrote: > Hi Henkka, > > This should be fairly easy to implement in a ProcessFunction. > You're making a good call to worry about the number of timers. If you > register a timer multiple times on the same t

Re: deduplication with streaming sql

2018-02-06 Thread Fabian Hueske
Hi Henkka, This should be fairly easy to implement in a ProcessFunction. You're making a good call to worry about the number of timers. If you register a timer multiple times on the same time, the timer is deduplicated ;-) and will only fire once for that time. That's why the state retention time

Re: deduplication with streaming sql

2018-02-06 Thread Henri Heiskanen
Hi, Thanks. Doing this deduplication would be easy just by using vanilla flink api and state (check if this is a new key and then emit), but the issue has been automatic state cleanup. However, it looks like this streaming sql retention time implementation uses the process function and timer. I w

Re: deduplication with streaming sql

2018-02-06 Thread Fabian Hueske
Hi Henri, thanks for reaching out and providing code and data to reproduce the issue. I think you are right, a "SELECT DISTINCT a, b, c FROM X" should not result in a retraction stream. However, with the current implementation we internally need a retraction stream if a state retention time is

Re: deduplication with streaming sql

2018-02-06 Thread Timo Walther
Hi Henri, I just noticed that I had a tiny mistake in my little test program. So SELECT DISTINCT is officially supported. But the question if this is a valid append stream is still up for discussion. I will loop in Fabian (in CC). For the general behavior you can also look into the code and

Re: deduplication with streaming sql

2018-02-06 Thread Timo Walther
Hi Henri, I try to answer your question: 1) You are right, SELECT DISTINCT should not need a retract stream. Internally, this is translated into an aggregation without an aggregate function call. So this definitely needs improvement. 2) The problem is that SELECT DISTINCT is not officially s