The Flink jobs are deployed in Yarn cluster. I am seeing the following log
for some of my jobs in Job Manager. I'm using Flink 1.4. The job has,
taskmanager.exit-on-fatal-akka-error=true. 
But I don't see the task manager being restarted. 

I made the following observations - 
1. One job does a join on two kafka topic. One of the stream didn't have any
data in last 24 hours. 
2. Two jobs that have the same log in JobManager.out but is working fine and
the records are being generated. 

{"debug_level":"ERROR","debug_timestamp":"2018-10-24
05:23:25,092","debug_thread":"flink-akka.actor.default-dispatcher-20","debug_file":"MarkerIgnoringBase.java",
"debug_line":"161","debug_message":"Association to
[akka.tcp://flink@ip-*-*-*-*.ap-southeast-1.compute.internal:58208] with UID
[930934199] irrecoverably failed. Quarantining address.", "job_name":
"eb99e094-74c9-4036-aa08-d379d62b7ff2" }
java.util.concurrent.TimeoutException: Remote system has been silent for too
long. (more than 48.0 hours)
        at
akka.remote.ReliableDeliverySupervisor$$anonfun$idle$1.applyOrElse(Endpoint.scala:375)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
        at 
akka.remote.ReliableDeliverySupervisor.aroundReceive(Endpoint.scala:203)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
        at akka.actor.ActorCell.invoke(ActorCell.scala:495)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
        at akka.dispatch.Mailbox.run(Mailbox.scala:224)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


I read other people having the same kind of issue and that
taskmanager.exit-on-fatal-akka-error setting worked for them. I'm not sure
why I'm seeing this issue and why is that the stream is working fine without
restart and with the error. Will appreciate any help. Thanks!




--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Reply via email to