Re: Task Manager detached under load

2018-06-30 Thread Till Rohrmann
Hi Vishal, you should not need to configure anything else. Cheers, Till On Sat, Jun 30, 2018 at 7:23 PM Vishal Santoshi wrote: > A clarification.. In 1.5 with custom heartbeats are there additional > configurations we should be concerned about ? > > On Fri, May 25, 2018 at 10:17 AM, Steven Wu

Re: Task Manager detached under load

2018-06-30 Thread Vishal Santoshi
A clarification.. In 1.5 with custom heartbeats are there additional configurations we should be concerned about ? On Fri, May 25, 2018 at 10:17 AM, Steven Wu wrote: > Till, thanks for the follow-up. looking forward to 1.5 :) > > On Fri, May 25, 2018 at 2:11 AM, Till Rohrmann > wrote: > >> Hi

Re: Task Manager detached under load

2018-05-25 Thread Steven Wu
Till, thanks for the follow-up. looking forward to 1.5 :) On Fri, May 25, 2018 at 2:11 AM, Till Rohrmann wrote: > Hi Steven, > > we don't have `jobmanager.exit-on-fatal-akka-error` because then the JM > would also be killed if a single TM gets quarantined. This is also not

Re: Task Manager detached under load

2018-05-25 Thread Till Rohrmann
Hi Steven, we don't have `jobmanager.exit-on-fatal-akka-error` because then the JM would also be killed if a single TM gets quarantined. This is also not a desired behaviour. With Flink 1.5 the problem with quarantining should be gone since we don't rely anymore on Akka's death watch and instead

Re: Task Manager detached under load

2018-05-13 Thread Steven Wu
Till, thanks for the clarification. yes, that situation is undesirable either. In our case, restarting jobmanager could also recover the job from akk association lock-out. it was actually the issue (high GC pause) on jobmanager side that caused the akka failure. do we have sth like

Re: Task Manager detached under load

2018-05-13 Thread Till Rohrmann
Hi Steven, the reason why we did not turn on this feature per default was that in case of a true JM failure, all of the TMs will think that they got quarantined which triggers their shut down. Depending on how many container restarts you have left on Yarn, for example, this can lead to a

Re: Task Manager detached under load

2018-04-25 Thread Steven Wu
Till, We ran into the same issue. It started with high GC pause that caused jobmanager to lose zk conn and leadership and caused jobmanager to quarantine taskmanager in akka. Once quarantined, akka association btw jobmanager and taskmanager is locked forever. Your suggestion of "

Re: Task Manager detached under load

2018-02-24 Thread ashish pok
@Jelmer, this is Till's las response on the issue. -- Ashish On Mon, Feb 5, 2018 at 5:56 AM, Till Rohrmann wrote: Hi, this sounds like a serious regression wrt Flink 1.3.2 and we should definitely find out what's causing this problem. Given from what I see in the

Re: Task Manager detached under load

2018-02-05 Thread Ashish Pokharel
Hi Till, Thanks for detailed response. I will try to gather some of this information during the week and follow up. — Ashish > On Feb 5, 2018, at 5:55 AM, Till Rohrmann wrote: > > Hi, > > this sounds like a serious regression wrt Flink 1.3.2 and we should > definitely

Re: Task Manager detached under load

2018-02-05 Thread Till Rohrmann
Hi, this sounds like a serious regression wrt Flink 1.3.2 and we should definitely find out what's causing this problem. Given from what I see in the logs, the following happens: For some time the JobManager seems to no longer receive heartbeats from the TaskManager. This could be, for example,

Re: Task Manager detached under load

2018-01-30 Thread Cliff Resnick
I've seen a similar issue while running successive Flink SQL batches on 1.4. In my case, the Job Manager would fail with the log output about unreachability (with an additional statement about something going "horribly wrong"). Under workload pressure, I reverted to 1.3.2 where everything works

Re: Task Manager detached under load

2018-01-24 Thread Ashish Pokharel
I haven’t gotten much further with this. It doesn’t look like GC related - at least GC counters were not that atrocious. However, my main concern was once the load subsides why aren’t TM and JM connecting again? That doesn’t look normal. I could definitely tell JM was listening on the port and

Re: Task Manager detached under load

2018-01-23 Thread Lasse Nedergaard
Hi. Did you find a reason for the detaching ? I sometimes see the same on our system running Flink 1.4 on dc/os. I have enabled taskmanager.Debug.memory.startlogthread for debugging. Med venlig hilsen / Best regards Lasse Nedergaard > Den 20. jan. 2018 kl. 12.57 skrev Kien Truong

Re: Task Manager detached under load

2018-01-20 Thread Kien Truong
Hi, You should enable and check your garbage collection log. We've encountered case where Task Manager disassociated due to long GC pause. Regards, Kien On 1/20/2018 1:27 AM, ashish pok wrote: Hi All, We have hit some load related issues and was wondering if any one has some