Re: Flink HA on AWS: Network related issue

2016-09-15 Thread Deepak Jha
Hi Till,
There is a way to shutdown actor systems by
setting  taskmanager.maxRegistrationDuration to a reasonable duration
(eg: 900 seconds). Default value sets it to Inf. In this case I noticed
that Taskmanager goes down and runit restarts the service and it gets
connected with Jobmanager.

 As I said earlier as well that retries to connect to Jobmanager does not
work even though telnet works at the same time to the same Jobmanager on
port 50050.  So retry does cache something which does not allow it to
reconnect. My flink cluster is on aws ( m4.large instances), not sure if
anyone else has observed this behavior.



On Thursday, September 15, 2016, Till Rohrmann  wrote:

> Hi Deepak,
>
> it seems that the JobManager's deathwatch declares the TaskManager to be
> unreachable which will automatically quarantine it. You're right that in
> such a case, the TaskManager should shut down and be restarted so that it
> can again reconnect to the JobManager. This is, however, not yet supported
> automatically.
>
> For the moment, I'd recommend you to make the deathwatch a little bit less
> aggressive via the following settings:
>
> - akka.watch.heartbeat.interval: Heartbeat interval for Akka’s DeathWatch
> mechanism to detect dead TaskManagers. If TaskManagers are wrongly marked
> dead because of lost or delayed heartbeat messages, then you should
> increase this value. A thorough description of Akka’s DeathWatch can be
> found here (DEFAULT: akka.ask.timeout/10).
> - akka.watch.heartbeat.pause: Acceptable heartbeat pause for Akka’s
> DeathWatch mechanism. A low value does not allow a irregular heartbeat. A
> thorough description of Akka’s DeathWatch can be found here (DEFAULT:
> akka.ask.timeout).
> - akka.watch.threshold: Threshold for the DeathWatch failure detector. A
> low value is prone to false positives whereas a high value increases the
> time to detect a dead TaskManager. A thorough description of Akka’s
> DeathWatch can be found here (DEFAULT: 12).
>
> I hope this helps you to work around the problem for the moment until we've
> added the automatic shut down and restart.
>
> Cheers,
> Till
>
> On Mon, Sep 12, 2016 at 5:55 AM, Deepak Jha  > wrote:
>
> > Hi Till,
> > One more thing i noticed after looking into following message in
> > taskmanager log
> >
> > 2016-09-11 17:57:25,310 PDT [WARN]  ip-10-6-0-15
> > [flink-akka.actor.default-dispatcher-31] Remoting - Tried to associate
> > with
> > unreachable remote address [akka.tcp://flink@10.6.22.22:50050]. Address
> is
> > now gated for 5000 ms, all messages to this address will be delivered to
> > dead letters. Reason: *The remote system has quarantined this system. No
> > further associations to the remote system are possible until this system
> is
> > restarted*.
> >
> > So in this case ideally the local ActorSystem should go down so that
> > service supervisor/runit will restart the system and taskmanager will
> again
> > be able to connect to the remote system.. If it does not happen
> > automatically then we have to monitor logs in some way and then try to
> > ensure that it restarts. Ideally flink taskmanager Actor System should go
> > down. Please let me know if my understanding is wrong.
> >
> >
> >
> >
> >
> > On Fri, Sep 9, 2016 at 8:01 AM, Deepak Jha  > wrote:
> >
> > > Hi Till,
> > > I'm getting following message in Jobmanager log
> > >
> > > 2016-09-09 07:46:55,093 PDT [WARN]  ip-10-8-11-249
> > > [flink-akka.actor.default-dispatcher-985] akka.remote.RemoteWatcher -
> > *Detected
> > > unreachable: [akka.tcp://flink@10.8.4.57:6121
> > > ]*
> > > 2016-09-09 07:46:55,094 PDT [INFO]  ip-10-8-11-249
> > > [flink-akka.actor.default-dispatcher-985] o.a.f.runtime.jobmanager.
> > JobManager
> > > - Task manager akka.tcp://flink@10.8.4.57:6121/user/taskmanager
> > > terminated.
> > > 2016-09-09 07:46:55,094 PDT [INFO]  ip-10-8-11-249
> > > [flink-akka.actor.default-dispatcher-985] o.a.f.r.instance.
> > InstanceManager
> > > - Unregistered task manager akka.tcp://flink@10.8.4.57 :
> > > 6121/user/taskmanager. Number of registered task managers 2. Number of
> > > available slots 4.
> > > 2016-09-09 07:46:55,096 PDT [WARN]  ip-10-8-11-249
> > > [flink-akka.actor.default-dispatcher-982] Remoting - Association to
> > > [akka.tcp://flink@10.8.4.57:6121] having UID [-1223410403] is
> > > irrecoverably failed. *UID is now quarantined and all messages to this
> > > UID will be delivered to dead letters. Remote actorsystem must be
> > restarted
> > > to recover from this situation.*
> > > 2016-09-09 07:46:55,097 PDT [INFO]  ip-10-8-11-249
> > > [flink-akka.actor.default-dispatcher-982] akka.actor.LocalActorRef -
> > > Message [akka.remote.transport.AssociationHandle$Disassociated] from
> > > Actor[akka://flink/deadLetters] to Actor[akka://flink/system/
> > > endpointManager/reliableEndpointWriter-akka.
> > 

Re: Flink HA on AWS: Network related issue

2016-09-11 Thread Deepak Jha
Hi Till,
One more thing i noticed after looking into following message in
taskmanager log

2016-09-11 17:57:25,310 PDT [WARN]  ip-10-6-0-15
[flink-akka.actor.default-dispatcher-31] Remoting - Tried to associate with
unreachable remote address [akka.tcp://flink@10.6.22.22:50050]. Address is
now gated for 5000 ms, all messages to this address will be delivered to
dead letters. Reason: *The remote system has quarantined this system. No
further associations to the remote system are possible until this system is
restarted*.

So in this case ideally the local ActorSystem should go down so that
service supervisor/runit will restart the system and taskmanager will again
be able to connect to the remote system.. If it does not happen
automatically then we have to monitor logs in some way and then try to
ensure that it restarts. Ideally flink taskmanager Actor System should go
down. Please let me know if my understanding is wrong.





On Fri, Sep 9, 2016 at 8:01 AM, Deepak Jha  wrote:

> Hi Till,
> I'm getting following message in Jobmanager log
>
> 2016-09-09 07:46:55,093 PDT [WARN]  ip-10-8-11-249
> [flink-akka.actor.default-dispatcher-985] akka.remote.RemoteWatcher - 
> *Detected
> unreachable: [akka.tcp://flink@10.8.4.57:6121
> ]*
> 2016-09-09 07:46:55,094 PDT [INFO]  ip-10-8-11-249
> [flink-akka.actor.default-dispatcher-985] o.a.f.runtime.jobmanager.JobManager
> - Task manager akka.tcp://flink@10.8.4.57:6121/user/taskmanager
> terminated.
> 2016-09-09 07:46:55,094 PDT [INFO]  ip-10-8-11-249
> [flink-akka.actor.default-dispatcher-985] o.a.f.r.instance.InstanceManager
> - Unregistered task manager akka.tcp://flink@10.8.4.57:
> 6121/user/taskmanager. Number of registered task managers 2. Number of
> available slots 4.
> 2016-09-09 07:46:55,096 PDT [WARN]  ip-10-8-11-249
> [flink-akka.actor.default-dispatcher-982] Remoting - Association to
> [akka.tcp://flink@10.8.4.57:6121] having UID [-1223410403] is
> irrecoverably failed. *UID is now quarantined and all messages to this
> UID will be delivered to dead letters. Remote actorsystem must be restarted
> to recover from this situation.*
> 2016-09-09 07:46:55,097 PDT [INFO]  ip-10-8-11-249
> [flink-akka.actor.default-dispatcher-982] akka.actor.LocalActorRef -
> Message [akka.remote.transport.AssociationHandle$Disassociated] from
> Actor[akka://flink/deadLetters] to Actor[akka://flink/system/
> endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fflink%4010.8.4.57%
> 3A6121-0/endpointWriter/endpointReader-akka.tcp%3A%2F%
> 2Fflink%4010.8.4.57%3A6121-0#393939009] was not delivered. [54] dead
> letters encountered. This logging can be turned off or adjusted with
> configuration settings 'akka.log-dead-letters' and
> 'akka.log-dead-letters-during-shutdown'.
> 2016-09-09 07:46:55,098 PDT [INFO]  ip-10-8-11-249
> [flink-akka.actor.default-dispatcher-985] akka.actor.LocalActorRef -
> Message [akka.remote.transport.AssociationHandle$Disassociated] from
> Actor[akka://flink/deadLetters] to Actor[akka://flink/system/transports/
> akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%
> 2Fflink%4010.8.4.57%3A51291-2#1151730456] was not delivered. [55] dead
> letters encountered. This logging can be turned off or adjusted with
> configuration settings 'akka.log-dead-letters' and
> 'akka.log-dead-letters-during-shutdown'.
> 2016-09-09 07:46:58,479 PDT [INFO]  ip-10-8-11-249
> [ForkJoinPool-3-worker-1] o.a.f.r.c.ZooKeeperCompletedCheckpointStore -
> Recovering checkpoints from ZooKeeper.
>
> Hope it helps. I'm using Flink 1.0.2
>
> On Fri, Sep 9, 2016 at 12:34 AM, Till Rohrmann 
> wrote:
>
>> Hi Deepak,
>>
>> could you check the logs whether the JobManager has been quarantined and
>> thus, cannot be connected to anymore? The logs should at least contain a
>> hint why the TaskManager lost the connection initially.
>>
>> Cheers,
>> Till
>>
>> On Thu, Sep 8, 2016 at 7:08 PM, Deepak Jha  wrote:
>>
>> > Hi,
>> > I've setup Flink HA on AWS ( 3 Taskmanagers and 2 Jobmanagers each are
>> on
>> > EC2 m4.large instance with checkpoint enabled on S3 ). My topology works
>> > fine, but after few hours I do see that Taskmanagers gets detached with
>> > Jobmanager. I tried to reach Jobmanager using telnet at the same time
>> and
>> > it worked but Taskmanager does not succeed in connecting again. It
>> attaches
>> > only after I restart it. I tried following settings but still the
>> problem
>> > persists.
>> >
>> > akka.ask.timeout: 20 s
>> > akka.lookup.timeout: 20 s
>> > akka.watch.heartbeat.interval: 20 s
>> >
>> > Please find attached snapshot on one of the Taskmanager. Is there any
>> > setting that I need to do ?
>> >
>> > --
>> > Thanks,
>> > Deepak Jha
>> >
>> >
>>
>
>
>
> --
> Thanks,
> Deepak Jha
>
>


-- 
Thanks,
Deepak Jha


Re: Flink HA on AWS: Network related issue

2016-09-09 Thread Deepak Jha
Hi Till,
I'm getting following message in Jobmanager log

2016-09-09 07:46:55,093 PDT [WARN]  ip-10-8-11-249
[flink-akka.actor.default-dispatcher-985] akka.remote.RemoteWatcher - *Detected
unreachable: [akka.tcp://flink@10.8.4.57:6121
]*
2016-09-09 07:46:55,094 PDT [INFO]  ip-10-8-11-249
[flink-akka.actor.default-dispatcher-985]
o.a.f.runtime.jobmanager.JobManager - Task manager akka.tcp://
flink@10.8.4.57:6121/user/taskmanager terminated.
2016-09-09 07:46:55,094 PDT [INFO]  ip-10-8-11-249
[flink-akka.actor.default-dispatcher-985] o.a.f.r.instance.InstanceManager
- Unregistered task manager akka.tcp://flink@10.8.4.57:6121/user/taskmanager.
Number of registered task managers 2. Number of available slots 4.
2016-09-09 07:46:55,096 PDT [WARN]  ip-10-8-11-249
[flink-akka.actor.default-dispatcher-982] Remoting - Association to
[akka.tcp://flink@10.8.4.57:6121] having UID [-1223410403] is irrecoverably
failed. *UID is now quarantined and all messages to this UID will be
delivered to dead letters. Remote actorsystem must be restarted to recover
from this situation.*
2016-09-09 07:46:55,097 PDT [INFO]  ip-10-8-11-249
[flink-akka.actor.default-dispatcher-982] akka.actor.LocalActorRef -
Message [akka.remote.transport.AssociationHandle$Disassociated] from
Actor[akka://flink/deadLetters] to
Actor[akka://flink/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fflink%4010.8.4.57%3A6121-0/endpointWriter/endpointReader-akka.tcp%3A%2F%2Fflink%4010.8.4.57%3A6121-0#393939009]
was not delivered. [54] dead letters encountered. This logging can be
turned off or adjusted with configuration settings 'akka.log-dead-letters'
and 'akka.log-dead-letters-during-shutdown'.
2016-09-09 07:46:55,098 PDT [INFO]  ip-10-8-11-249
[flink-akka.actor.default-dispatcher-985] akka.actor.LocalActorRef -
Message [akka.remote.transport.AssociationHandle$Disassociated] from
Actor[akka://flink/deadLetters] to
Actor[akka://flink/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2Fflink%4010.8.4.57%3A51291-2#1151730456]
was not delivered. [55] dead letters encountered. This logging can be
turned off or adjusted with configuration settings 'akka.log-dead-letters'
and 'akka.log-dead-letters-during-shutdown'.
2016-09-09 07:46:58,479 PDT [INFO]  ip-10-8-11-249
[ForkJoinPool-3-worker-1] o.a.f.r.c.ZooKeeperCompletedCheckpointStore -
Recovering checkpoints from ZooKeeper.

Hope it helps. I'm using Flink 1.0.2

On Fri, Sep 9, 2016 at 12:34 AM, Till Rohrmann  wrote:

> Hi Deepak,
>
> could you check the logs whether the JobManager has been quarantined and
> thus, cannot be connected to anymore? The logs should at least contain a
> hint why the TaskManager lost the connection initially.
>
> Cheers,
> Till
>
> On Thu, Sep 8, 2016 at 7:08 PM, Deepak Jha  wrote:
>
> > Hi,
> > I've setup Flink HA on AWS ( 3 Taskmanagers and 2 Jobmanagers each are on
> > EC2 m4.large instance with checkpoint enabled on S3 ). My topology works
> > fine, but after few hours I do see that Taskmanagers gets detached with
> > Jobmanager. I tried to reach Jobmanager using telnet at the same time and
> > it worked but Taskmanager does not succeed in connecting again. It
> attaches
> > only after I restart it. I tried following settings but still the problem
> > persists.
> >
> > akka.ask.timeout: 20 s
> > akka.lookup.timeout: 20 s
> > akka.watch.heartbeat.interval: 20 s
> >
> > Please find attached snapshot on one of the Taskmanager. Is there any
> > setting that I need to do ?
> >
> > --
> > Thanks,
> > Deepak Jha
> >
> >
>



-- 
Thanks,
Deepak Jha


Re: Flink HA on AWS: Network related issue

2016-09-09 Thread Till Rohrmann
Hi Deepak,

could you check the logs whether the JobManager has been quarantined and
thus, cannot be connected to anymore? The logs should at least contain a
hint why the TaskManager lost the connection initially.

Cheers,
Till

On Thu, Sep 8, 2016 at 7:08 PM, Deepak Jha  wrote:

> Hi,
> I've setup Flink HA on AWS ( 3 Taskmanagers and 2 Jobmanagers each are on
> EC2 m4.large instance with checkpoint enabled on S3 ). My topology works
> fine, but after few hours I do see that Taskmanagers gets detached with
> Jobmanager. I tried to reach Jobmanager using telnet at the same time and
> it worked but Taskmanager does not succeed in connecting again. It attaches
> only after I restart it. I tried following settings but still the problem
> persists.
>
> akka.ask.timeout: 20 s
> akka.lookup.timeout: 20 s
> akka.watch.heartbeat.interval: 20 s
>
> Please find attached snapshot on one of the Taskmanager. Is there any
> setting that I need to do ?
>
> --
> Thanks,
> Deepak Jha
>
>


Flink HA on AWS: Network related issue

2016-09-08 Thread Deepak Jha
Hi,
I've setup Flink HA on AWS ( 3 Taskmanagers and 2 Jobmanagers each are on
EC2 m4.large instance with checkpoint enabled on S3 ). My topology works
fine, but after few hours I do see that Taskmanagers gets detached with
Jobmanager. I tried to reach Jobmanager using telnet at the same time and
it worked but Taskmanager does not succeed in connecting again. It attaches
only after I restart it. I tried following settings but still the problem
persists.

akka.ask.timeout: 20 s
akka.lookup.timeout: 20 s
akka.watch.heartbeat.interval: 20 s

Please find attached snapshot on one of the Taskmanager. Is there any
setting that I need to do ?

-- 
Thanks,
Deepak Jha