This is the exception before the job went into cancelled state. But when I
looked into the task manager node, the flink process is still running.
java.lang.Exception: TaskManager was lost/killed:
383f6af3299793ba73eeb7bdbab0ddc7 @
ip-xx.xx.xxx.xx.us-west-2.compute.internal (dataPort=37652)
at
org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217)
at
org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:533)
at
org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192)
at
org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167)
at
org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:212)
at org.apache.flink.runtime.jobmanager.JobManager.org
$apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(JobManager.scala:1202)
at
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:1105)
at
scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at
org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
at
scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at
org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
at
scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at
org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
at
org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
at
scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at
org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
at
org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:118)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at
akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:44)
at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
at akka.actor.ActorCell.invoke(ActorCell.scala:486)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
On Fri, Mar 10, 2017 at 5:40 AM, Robert Metzger <rmetz...@apache.org> wrote:
> Hi,
>
> this error is only logged at WARN level. As Kaibo already said, its not a
> critical issue.
>
> Can you send some more messages from your log. Usually the Jobmanager logs
> why a taskmanager has failed. And the last few log messages of the failed
> TM itself are also often helpful.
>
>
>
> On Fri, Mar 10, 2017 at 10:46 AM, Kaibo Zhou <zkb...@gmail.com> wrote:
>
>> I think this is not the root cause of job failure, this task is caused by
>> other tasks failing. You can check the log of the first failed task.
>>
>> 2017-03-10 12:25 GMT+08:00 Govindarajan Srinivasaraghavan <
>> govindragh...@gmail.com>:
>>
>>> Hi All,
>>>
>>> I see the below error after running my streaming job for a while and
>>> when the load increases. After a while the task manager becomes completely
>>> dead and the job keeps on restarting.
>>>
>>> Also when I checked if there is an back pressure in the UI, it kept on
>>> saying sampling in progress and no results were displayed. Is there an API
>>> which can provide the back pressure details?
>>>
>>> 2017-03-10 01:40:58,793 WARN org.apache.flink.streaming.ap
>>> i.operators.AbstractStreamOperator - Error while emitting latency
>>> marker.
>>> org.apache.flink.streaming.runtime.tasks.ExceptionInChainedOperatorException:
>>> Could not forward element to next operator
>>> at org.apache.flink.streaming.runtime.tasks.OperatorChain$Chain
>>> ingOutput.emitLatencyMarker(OperatorChain.java:426)
>>> at org.apache.flink.streaming.api.operators.AbstractStreamOpera
>>> tor$CountingOutput.emitLatencyMarker(AbstractStreamOperator.java:848)
>>> at org.apache.flink.streaming.api.operators.StreamSource$Latenc
>>> yMarksEmitter$1.onProcessingTime(StreamSource.java:152)
>>> at org.apache.flink.streaming.runtime.tasks.SystemProcessingTim
>>> eService$RepeatedTriggerTask.run(SystemProcessingTimeService.java:256)
>>> at java.util.concurrent.Executors$RunnableAdapter.call(Executor
>>> s.java:511)
>>> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:
>>> 308)
>>> at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFu
>>> tureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>>> at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFu
>>> tureTask.run(ScheduledThreadPoolExecutor.java:294)
>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>>> Executor.java:1142)
>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>>> lExecutor.java:617)
>>> at java.lang.Thread.run(Thread.java:745)
>>> Caused by: java.lang.RuntimeException
>>> at org.apache.flink.streaming.runtime.io.RecordWriterOutput.emi
>>> tLatencyMarker(RecordWriterOutput.java:117)
>>> at org.apache.flink.streaming.api.operators.AbstractStreamOpera
>>> tor$CountingOutput.emitLatencyMarker(AbstractStreamOperator.java:848)
>>> at org.apache.flink.streaming.api.operators.AbstractStreamOpera
>>> tor.reportOrForwardLatencyMarker(AbstractStreamOperator.java:708)
>>> at org.apache.flink.streaming.api.operators.AbstractStreamOpera
>>> tor.processLatencyMarker(AbstractStreamOperator.java:690)
>>> at org.apache.flink.streaming.runtime.tasks.OperatorChain$Chain
>>> ingOutput.emitLatencyMarker(OperatorChain.java:423)
>>> ... 10 more
>>> Caused by: java.lang.InterruptedException
>>> at java.lang.Object.wait(Native Method)
>>> at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.r
>>> equestBuffer(LocalBufferPool.java:168)
>>> at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.r
>>> equestBufferBlocking(LocalBufferPool.java:138)
>>> at org.apache.flink.runtime.io.network.api.writer.RecordWriter.
>>> sendToTarget(RecordWriter.java:132)
>>> at org.apache.flink.runtime.io.network.api.writer.RecordWriter.
>>> randomEmit(RecordWriter.java:107)
>>> at org.apache.flink.streaming.runtime.io.StreamRecordWriter.ran
>>> domEmit(StreamRecordWriter.java:104)
>>> at org.apache.flink.streaming.runtime.io.RecordWriterOutput.emi
>>> tLatencyMarker(RecordWriterOutput.java:114)
>>> ... 14 more
>>>
>>>
>>>
>>
>