Re: Even out the number of generated windows

Bowen Li Mon, 28 Aug 2017 12:13:30 -0700

That's exactly what I found yesterday! Thank you Aljoscha for confirming it!


On Mon, Aug 28, 2017 at 2:57 AM, Aljoscha Krettek <aljos...@apache.org>
wrote:

> Hi Bowen,
>
> There is not built-in TTL but you can use a ProcessFunction to set a timer
> that clears state.
>
> ProcessFunction docs: https://ci.apache.org/projects/flink/flink-docs-
> release-1.3/dev/stream/process_function.html
>
> Best,
> Aljoscha
>
> On 27. Aug 2017, at 19:19, Bowen Li <bowen...@offerupnow.com> wrote:
>
> Hi Robert,
>     Thank you for the suggestion, I'll try that.
>
>     On a second thought, I can actually reduce the amount of generated
> output so there aren't that many records being sent to Kinesis.
>
>     What I want to do is to use Flink's state to keep track of the last
> computation result of a window by each key. If the latest computation
> result is the same as the last one, my Flink job shouldn't emit a new
> record. However, that requires some expiration functionality so that the
> state won't grow indefinitely, as explained in https://issues.apache.org/
> jira/browse/FLINK-3089. Are there anyway to expire keyed state by time?
>
> Thanks,
> Bowen
>
>
>
> On Sun, Aug 27, 2017 at 5:41 AM, Robert Metzger <rmetz...@apache.org>
> wrote:
>
>> Hi Bowen,
>>
>> I don't know what kind of relationship your company has to AWS, maybe
>> they are willing to look into the issue from their side.
>>
>> To throttle a stream, I would recommend just doing a map operation that
>> is calling  "Thread.sleep(<ms>)" every n records.
>>
>> On Sat, Aug 26, 2017 at 4:11 AM, Bowen Li <bowen...@offerupnow.com>
>> wrote:
>>
>>> Hi Robert,
>>> We use kinesis sink (FlinkKinesisProducer). The main pain is the Kinesis
>>> Producer Library (KPL) that FlinkKinesisProducer uses.
>>>
>>> KPL is basically a java wrapper with a c++ core. It's slow, unstable,
>>> easy to crash, memory-and-CPU-consuming (it sends traffic via HTTP), and
>>> can't handle high workload like a few million records at a short period of
>>> time. Well, in order to write to Kinesis, there's no other options except
>>> KPL (AWS Kinesis SDK is even slower), so I'm not blaming Flink chose KPL.
>>>
>>> Are there any recommended ways to "artificially throttle down the
>>> stream before the sink"? How to add the throttling into Flink's fluent
>>> API?
>>>
>>> Thanks,
>>> Bowen
>>>
>>>
>>> On Fri, Aug 25, 2017 at 2:31 PM, Robert Metzger <rmetz...@apache.org>
>>> wrote:
>>>
>>>> Hi Bowen,
>>>>
>>>> (very nice graphics :) )
>>>>
>>>> I don't think you can do anything about the windows itself (unless you
>>>> are able to build the windows yourself using the ProcessFunction, playing
>>>> some tricks because you know your data), so I should focus on reducing the
>>>> pain in "burning down your sink".
>>>> Are there any issues with the Sink by the spikes? (What's the
>>>> downstream system?)
>>>> Does it make sense for you to artificially throttle down the stream
>>>> before the sink, so that the records per second get limited to a certain
>>>> rate. Since you are using Event time, the window results will always be
>>>> correct & consistent. From a business perspective, this will of course
>>>> introduce additional latency (= results come in later).
>>>>
>>>>
>>>> On Fri, Aug 25, 2017 at 6:23 AM, Bowen Li <bowen...@offerupnow.com>
>>>> wrote:
>>>>
>>>>> Hi guys,
>>>>>
>>>>> I do have a question for how Flink generates windows.
>>>>>
>>>>> We are using a 1-day sized sliding window with 1-hour slide to count
>>>>> some features of items based on event time. We have about 20million items.
>>>>> We observed that Flink only emit results on a fixed time in an hour (e.g.
>>>>> 1am, 2am, 3am,  or 1:15am, 2:15am, 3:15am with a 15min offset). That's
>>>>> means 20million windows/records are generated at the same time every hour,
>>>>> which burns down our sink. But nothing is generated in the rest of that
>>>>> hour. The pattern is like this:
>>>>>
>>>>> # generated windows
>>>>> |
>>>>> |    /\                  /\
>>>>> |   /  \                /  \
>>>>> |_/__\_______/__\_
>>>>>                                  time
>>>>>
>>>>> Is there any way to even out the number of generated windows/records
>>>>> in an hour? Can we have evenly distributed generated load like this?
>>>>>
>>>>> # generated windows
>>>>> |
>>>>> |
>>>>> | ------------------------
>>>>> |_______________
>>>>>                                  time
>>>>>
>>>>> Thanks,
>>>>> Bowen
>>>>>
>>>>>
>>>>
>>>
>>
>
>

Re: Even out the number of generated windows

Reply via email to