Hi,
yes, timers cannot easily fire in parallel to event processing for correctness
reasons because they both manipulate the state and there should be a distinct
order of operations. If it is literally stuck, then it is obviously a problem.
From the stack trace it looks pretty clear that the
Thanks Stefan,
You are correct , I learned the hard way that when timers fires it stops
processing new events till the time all timers callback completes. This is
the points when I decided to isolate the problem by scheduling only 5-6K
timers in total so that even if its taking time in timers it
Hi,
let me first clarify what you mean by „stuck“, just because your job stops
consuming events for some time does not necessarily mean that it is „stuck“.
That is very hard to evaluate from the information we have so far, because from
the stack trace you cannot conclude that the thread is
Hi Richer,
Actually for the testing , now I have reduced the number of timers to few
thousands (5-6K) but my job still gets stuck randomly. And its not
reproducible each time. next time when I restart the job it again starts
working for few few hours/days then gets stuck again.
I took thread
Hi,
Did you check the metrics for the garbage collector? Stuck with high CPU
consumption and lots of timers sound like there could be a possible problem,
because timer are currently on-heap objects, but we are working on
RocksDB-based timers right now.
Best,
Stefan
> Am 12.07.2018 um 14:54
Thanks Stefan/Stephan/Nico,
Indeed there are 2 problem. For the 2nd problem ,I am almost certain that
explanation given by Stephan is the true as in my case as there number of
timers are in millions. (Each for different key so I guess coalescing is
not an option for me).
If I simplify my
If this is about too many timers and your application allows it, you may
also try to reduce the timer resolution and thus frequency by coalescing
them [1].
Nico
[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/stream/operators/process_function.html#timer-coalescing
On
Hi shishal!
I think there is an issue with cancellation when many timers fire at the
same time. These timers have to finish before shutdown happens, this seems
to take a while in your case.
Did the TM process actually kill itself in the end (and got restarted)?
On Wed, Jul 11, 2018 at 9:29
Hi,
I am using flink 1.4.2 with rocksdb as backend. I am using process function
with timer on EventTime. For checkpointing I am using hdfs.
I am trying load testing so Iam reading kafka from beginning (aprox 7 days
data with 50M events).
My job gets stuck after aprox 20 min with no error.