Hi Bowen, There is not built-in TTL but you can use a ProcessFunction to set a timer that clears state.
ProcessFunction docs: https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/process_function.html Best, Aljoscha > On 27. Aug 2017, at 19:19, Bowen Li <bowen...@offerupnow.com> wrote: > > Hi Robert, > Thank you for the suggestion, I'll try that. > > On a second thought, I can actually reduce the amount of generated output > so there aren't that many records being sent to Kinesis. > > What I want to do is to use Flink's state to keep track of the last > computation result of a window by each key. If the latest computation result > is the same as the last one, my Flink job shouldn't emit a new record. > However, that requires some expiration functionality so that the state won't > grow indefinitely, as explained in > https://issues.apache.org/jira/browse/FLINK-3089 > <https://issues.apache.org/jira/browse/FLINK-3089>. Are there anyway to > expire keyed state by time? > > Thanks, > Bowen > > > > On Sun, Aug 27, 2017 at 5:41 AM, Robert Metzger <rmetz...@apache.org > <mailto:rmetz...@apache.org>> wrote: > Hi Bowen, > > I don't know what kind of relationship your company has to AWS, maybe they > are willing to look into the issue from their side. > > To throttle a stream, I would recommend just doing a map operation that is > calling "Thread.sleep(<ms>)" every n records. > > On Sat, Aug 26, 2017 at 4:11 AM, Bowen Li <bowen...@offerupnow.com > <mailto:bowen...@offerupnow.com>> wrote: > Hi Robert, > We use kinesis sink (FlinkKinesisProducer). The main pain is the Kinesis > Producer Library (KPL) that FlinkKinesisProducer uses. > > KPL is basically a java wrapper with a c++ core. It's slow, unstable, easy to > crash, memory-and-CPU-consuming (it sends traffic via HTTP), and can't handle > high workload like a few million records at a short period of time. Well, in > order to write to Kinesis, there's no other options except KPL (AWS Kinesis > SDK is even slower), so I'm not blaming Flink chose KPL. > > Are there any recommended ways to "artificially throttle down the stream > before the sink"? How to add the throttling into Flink's fluent API? > > Thanks, > Bowen > > > On Fri, Aug 25, 2017 at 2:31 PM, Robert Metzger <rmetz...@apache.org > <mailto:rmetz...@apache.org>> wrote: > Hi Bowen, > > (very nice graphics :) ) > > I don't think you can do anything about the windows itself (unless you are > able to build the windows yourself using the ProcessFunction, playing some > tricks because you know your data), so I should focus on reducing the pain in > "burning down your sink". > Are there any issues with the Sink by the spikes? (What's the downstream > system?) > Does it make sense for you to artificially throttle down the stream before > the sink, so that the records per second get limited to a certain rate. Since > you are using Event time, the window results will always be correct & > consistent. From a business perspective, this will of course introduce > additional latency (= results come in later). > > > On Fri, Aug 25, 2017 at 6:23 AM, Bowen Li <bowen...@offerupnow.com > <mailto:bowen...@offerupnow.com>> wrote: > Hi guys, > > I do have a question for how Flink generates windows. > > We are using a 1-day sized sliding window with 1-hour slide to count some > features of items based on event time. We have about 20million items. We > observed that Flink only emit results on a fixed time in an hour (e.g. 1am, > 2am, 3am, or 1:15am, 2:15am, 3:15am with a 15min offset). That's means > 20million windows/records are generated at the same time every hour, which > burns down our sink. But nothing is generated in the rest of that hour. The > pattern is like this: > > # generated windows > | > | /\ /\ > | / \ / \ > |_/__\_______/__\_ > time > > Is there any way to even out the number of generated windows/records in an > hour? Can we have evenly distributed generated load like this? > > # generated windows > | > | > | ------------------------ > |_______________ > time > > Thanks, > Bowen > > > > >