Re: Flink HA on AWS: Network related issue

2016-09-15 Thread Deepak Jha
Hi Till, There is a way to shutdown actor systems by setting taskmanager.maxRegistrationDuration to a reasonable duration (eg: 900 seconds). Default value sets it to Inf. In this case I noticed that Taskmanager goes down and runit restarts the service and it gets connected with Jobmanager. As I

Re: Flink HA on AWS: Network related issue

2016-09-11 Thread Deepak Jha
Hi Till, One more thing i noticed after looking into following message in taskmanager log 2016-09-11 17:57:25,310 PDT [WARN] ip-10-6-0-15 [flink-akka.actor.default-dispatcher-31] Remoting - Tried to associate with unreachable remote address [akka.tcp://flink@10.6.22.22:50050]. Address is now

Re: Flink HA on AWS: Network related issue

2016-09-09 Thread Deepak Jha
Hi Till, I'm getting following message in Jobmanager log 2016-09-09 07:46:55,093 PDT [WARN] ip-10-8-11-249 [flink-akka.actor.default-dispatcher-985] akka.remote.RemoteWatcher - *Detected unreachable: [akka.tcp://flink@10.8.4.57:6121 ]* 2016-09-09 07:46:55,094 PDT

Re: Flink HA on AWS: Network related issue

2016-09-09 Thread Till Rohrmann
Hi Deepak, could you check the logs whether the JobManager has been quarantined and thus, cannot be connected to anymore? The logs should at least contain a hint why the TaskManager lost the connection initially. Cheers, Till On Thu, Sep 8, 2016 at 7:08 PM, Deepak Jha

Flink HA on AWS: Network related issue

2016-09-08 Thread Deepak Jha
Hi, I've setup Flink HA on AWS ( 3 Taskmanagers and 2 Jobmanagers each are on EC2 m4.large instance with checkpoint enabled on S3 ). My topology works fine, but after few hours I do see that Taskmanagers gets detached with Jobmanager. I tried to reach Jobmanager using telnet at the same time and