Checkpoints are largely asynchronous, but the checkpointing of timers has
some synchronous component (which we are currently working on getting rid
of).
So when you have a lot of timers, streams stall for a short time while the
timers are checkpointed. If all goes as planned, Flink 1.6 will not have
that stall any more.

Concerning the delay on timers - I think that is not an issue of heaps /
timer wheels, etc (timer wheels are not magically better at everything that
has to do with timers).
This sounds more like the execution becomes contended. The reason for the
contention could actually very well be the checkpointing of timers
(stalling when too many timers are registered).


On Wed, Sep 20, 2017 at 2:53 PM, Narendra Joshi <narendr...@gmail.com>
wrote:

> I have a couple of questions related to this:
>
> 1. We store state per key (Rocksdb backend). Currently, the state size
> is ~1.5Gb. Checkpointing time sometimes reaches ~10-20 seconds. Is it
> possible that checkpointing is affecting timer execution?
> 2. Does checkpointing cause Flink to stop consumption of data streams
> (say from Kafka)? We have observed that when the timers are delayed,
> there is delay in picking up messages from Kafka.
> 3. Are there any metrics exposed by Flink that could help us
> understand better where the delay is coming from? Is there a metric
> for knowing about contention between `processElement` and `onTimer`?
> 4. Is there a plan for moving from Scheduled Threadpool Executor to
> using timing wheels for timeout?
>
> If there is any other information that you need, please let me know.
>
> On Tue, Sep 19, 2017 at 10:37 PM, Narendra Joshi <narendr...@gmail.com>
> wrote:
> > The number of timers is about 400 per second. We have observed that
> onTimer
> > calls are delayed only when the number of scheduled timers starts
> increasing
> > from a minima. It would be great if you can share pointers to code I can
> > look at to understand it better. :)
> >
> > Narendra Joshi
> >
> > On 14 Sep 2017 16:04, "Aljoscha Krettek" <aljos...@apache.org> wrote:
> >>
> >> Hi,
> >>
> >> Yes, execution of these methods is protected by a synchronized block.
> This
> >> is not a fair lock so incoming data might starve timer callbacks. What
> is
> >> the number of timers we are talking about here?
> >>
> >> Best,
> >> Aljoscha
> >>
> >> > On 11. Sep 2017, at 19:38, Chesnay Schepler <c.schep...@web.de>
> wrote:
> >> >
> >> > It is true that onTimer and processElement are never called at the
> same
> >> > time.
> >> >
> >> > I'm not entirely sure whether there is any prioritization/fairness
> >> > between these methods
> >> > (if not if could be that onTimer is starved) , looping in Aljoscha who
> >> > hopefully knows more
> >> > about this.
> >> >
> >> > On 10.09.2017 09:31, Narendra Joshi wrote:
> >> >> Hi,
> >> >>
> >> >> We are using Flink as a timer scheduler and delay in timer execution
> is
> >> >> a huge problem for us. What we have experienced is that as the number
> >> >> of
> >> >> Timers we register increases the timers start getting delayed (for
> more
> >> >> than 5 seconds). Can anyone point us in the right direction to figure
> >> >> out what might be happening?
> >> >>
> >> >> I have been told that `onTimer` and `processElement` are called with
> a
> >> >> mutually exclusive lock. Could this locking be the reason this is
> >> >> happening? In both the functions there is no IO happening and it
> should
> >> >> not take 5 seconds.
> >> >>
> >> >> Is it possible that calls to `processElement` starve `onTimer` calls?
> >> >>
> >> >>
> >> >> --
> >> >> Narendra Joshi
> >> >>
> >> >
> >>
> >
>

Reply via email to