Bhumika Bayani created FLINK-8624:
-------------------------------------

             Summary: flink-mesos: The flink rest-api sometimes becomes 
unresponsive
                 Key: FLINK-8624
                 URL: https://issues.apache.org/jira/browse/FLINK-8624
             Project: Flink
          Issue Type: Bug
    Affects Versions: 1.3.2
            Reporter: Bhumika Bayani


Sometimes flink-mesos-scheduler fails/get killed, and marathon brings it up 
again on some other node. Sometimes we have observed, the rest-api of the newly 
created flink instance becomes unresponsive.

Even if we execute api calls manually with curl, such as 

http://<host>:<port>/overview or http://<host>:<port>/config

we do not receive any response. 

We submit and execute all our flink-jobs using rest-api only. So if rest api 
becomes un-responsive, that stops us from running any of the flink jobs and no 
stream processing happens. 

We tried enabling flink debug logs, but we did not observer anything specific 
that indicates why rest api is failing/unresponsive.

We see below exceptions in logs but that is not specific to case when flink-api 
is hung. We see them in healthy flink-scheduler too: 

 
{code:java}
Timestamp=2018-02-08 05:43:49,175 LogLevel=INFO
        ThreadId=[Checkpoint Timer] Class=o.a.f.r.c.CheckpointCoordinator 
Msg=Triggering checkpoint 10181 @ 1518068629174
Timestamp=2018-02-08 05:43:49,183 LogLevel=DEBUG
        ThreadId=[nioEventLoopGroup-5-3] Class=o.a.f.r.w.WebRuntimeMonitor 
Msg=Unhandled exception: {}
akka.pattern.AskTimeoutException: Ask timed out on 
[Actor[akka://flink/user/jobmanager#753807801]] after [10000 ms]
        at 
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334) 
~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT]
        at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117) 
~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT]
        at 
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
 ~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT]
        at 
scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) 
~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT]
        at 
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) 
~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT]
        at 
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:474)
 ~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT]
        at 
akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:425)
 ~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT]
        at 
akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:429) 
~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT]
        at 
akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:381) 
~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]
{code}
 

During the time rest api is unresponsive, we have observed flink web UI too 
does not load/show any information. 

Restarting the flink-scheduler solves this issue sometimes. 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to