Re: Flink job hangs using rocksDb as backend

2018-07-23 Thread Stefan Richter
Hi, yes, timers cannot easily fire in parallel to event processing for correctness reasons because they both manipulate the state and there should be a distinct order of operations. If it is literally stuck, then it is obviously a problem. From the stack trace it looks pretty clear that the

Re: Flink job hangs using rocksDb as backend

2018-07-23 Thread shishal singh
Thanks Stefan, You are correct , I learned the hard way that when timers fires it stops processing new events till the time all timers callback completes. This is the points when I decided to isolate the problem by scheduling only 5-6K timers in total so that even if its taking time in timers it

Re: Flink job hangs using rocksDb as backend

2018-07-23 Thread Stefan Richter
Hi, let me first clarify what you mean by „stuck“, just because your job stops consuming events for some time does not necessarily mean that it is „stuck“. That is very hard to evaluate from the information we have so far, because from the stack trace you cannot conclude that the thread is

Re: Flink job hangs using rocksDb as backend

2018-07-20 Thread shishal singh
Hi Richer, Actually for the testing , now I have reduced the number of timers to few thousands (5-6K) but my job still gets stuck randomly. And its not reproducible each time. next time when I restart the job it again starts working for few few hours/days then gets stuck again. I took thread

Re: Flink job hangs using rocksDb as backend

2018-07-12 Thread Stefan Richter
Hi, Did you check the metrics for the garbage collector? Stuck with high CPU consumption and lots of timers sound like there could be a possible problem, because timer are currently on-heap objects, but we are working on RocksDB-based timers right now. Best, Stefan > Am 12.07.2018 um 14:54

Re: Flink job hangs using rocksDb as backend

2018-07-12 Thread shishal singh
Thanks Stefan/Stephan/Nico, Indeed there are 2 problem. For the 2nd problem ,I am almost certain that explanation given by Stephan is the true as in my case as there number of timers are in millions. (Each for different key so I guess coalescing is not an option for me). If I simplify my

Re: Flink job hangs using rocksDb as backend

2018-07-11 Thread Nico Kruber
If this is about too many timers and your application allows it, you may also try to reduce the timer resolution and thus frequency by coalescing them [1]. Nico [1] https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/stream/operators/process_function.html#timer-coalescing On

Re: Flink job hangs using rocksDb as backend

2018-07-11 Thread Stephan Ewen
Hi shishal! I think there is an issue with cancellation when many timers fire at the same time. These timers have to finish before shutdown happens, this seems to take a while in your case. Did the TM process actually kill itself in the end (and got restarted)? On Wed, Jul 11, 2018 at 9:29

Flink job hangs using rocksDb as backend

2018-07-11 Thread shishal
Hi, I am using flink 1.4.2 with rocksdb as backend. I am using process function with timer on EventTime. For checkpointing I am using hdfs. I am trying load testing so Iam reading kafka from beginning (aprox 7 days data with 50M events). My job gets stuck after aprox 20 min with no error.