subject:"Re\: flink1.9.1任务已经fail掉了，但在yarn上这个application还是在running"

Re: Re: flink1.9.1任务已经fail掉了，但在yarn上这个application还是在running

2020-11-12 文章 amen...@163.com

hi,

我现在的版本是flink-1.11.1没有加-d参数，也遇见了同样的问题，不知道是什么情况呢？

best,
amenhub



 
发件人： Yang Wang
发送时间： 2020-08-05 10:28
收件人： user-zh
主题： Re: flink1.9.1任务已经fail掉了，但在yarn上这个application还是在running
你的Flink任务应该是用attach的方式起的，也就是没有加-d，这种情况在1.10之前起的任务本质上是一个session，
只有当结果被client端retrieve走以后，才会退出，如果client挂了或者你主动停掉了，那就会留下一个空的session
 
你可以通过如下log确认起的session模式
 
2020-08-04 10:45:36,868 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint -
Starting YarnSessionClusterEntrypoint (Version: 1.9.1, Rev:f23f82a,
Date:01.11.2019 @ 11:20:33 CST)
 
 
你可以flink run -d ...就是perjob模式了，或者升级到1.10及以后版本attach/detach都是真正的perjob
 
 
Best,
Yang
 
bradyMk  于2020年8月4日周二 下午8:04写道：
 
> 您好：
> 请问这是flink这个版本自身的bug么？那就意味着没有办法解决了吧，只能手动kill掉？
>
>
>
> -
> Best Wishes
> --
> Sent from: http://apache-flink.147419.n8.nabble.com/

Re: flink1.9.1任务已经fail掉了，但在yarn上这个application还是在running

2020-08-04 文章 bradyMk

原来如此！我重新加了-d 运行了任务，果然从 YarnSessionClusterEntrypoint  变成了
YarnJobClusterEntrypoint ，学习到了~这个问题困扰了我好久，真的万分感谢您的解答！谢谢！



-
Best Wishes
--
Sent from: http://apache-flink.147419.n8.nabble.com/

Re: flink1.9.1任务已经fail掉了，但在yarn上这个application还是在running

2020-08-04 文章 bradyMk

原来如此！果然用了-d后由 YarnSessionClusterEntrypoint 变成了 YarnJobClusterEntrypoint
；真的是万分感谢！这个问题困扰了我好久，感谢解答疑惑~



-
Best Wishes
--
Sent from: http://apache-flink.147419.n8.nabble.com/

Re: flink1.9.1任务已经fail掉了，但在yarn上这个application还是在running

2020-08-04 文章 bradyMk

您好：
请问这是flink这个版本自身的bug么？那就意味着没有办法解决了吧，只能手动kill掉？



-
Best Wishes
--
Sent from: http://apache-flink.147419.n8.nabble.com/

Re: flink1.9.1任务已经fail掉了，但在yarn上这个application还是在running

2020-08-04 文章 bradyMk

您好：
您说的完整的log是这个吧？还麻烦帮我看一下
jobmanager_log.txt
  



-
Best Wishes
--
Sent from: http://apache-flink.147419.n8.nabble.com/

Re: flink1.9.1任务已经fail掉了，但在yarn上这个application还是在running

2020-08-04 文章 JasonLee

hi
我记得我用1.6.0版本的时候就有这个问题 好像是没有对应的jira 不过我用新版本已经没有遇到这个问题了 应该是偶尔会出现



--
Sent from: http://apache-flink.147419.n8.nabble.com/

Re: flink1.9.1任务已经fail掉了，但在yarn上这个application还是在running

2020-08-04 文章 Yang Wang

@bradyMk，你可以把完整的JM
log发一下吗，这样我们能看一下Flink的YarnResourceManager为什么没有执行deregister的逻辑

@JasonLee，你说的bug是什么呢，已经有对应的JIRA了吗

Best,
Yang

JasonLee <17610775...@163.com> 于2020年8月4日周二 下午4:33写道：

> hi
> 这本身就是一个bug 应该是还没有修复
>
>
> | |
> JasonLee
> |
> |
> 邮箱：17610775...@163.com
> |
>
> Signature is customized by Netease Mail Master
>
> 在2020年08月04日 15:41，bradyMk 写道：
> 您好
> 我这边是用perJob的方式提交的，而且这种现象还是偶发性的，这次错误日志是这样的：
>
> 2020-08-04 10:30:14,475 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph- Job
> flink2Ots (e11a22af324049217fdff28aca9f73a5) switched from state FAILING to
> FAILED.
> java.lang.Exception: Container released on a *lost* node
>at
>
> org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted$0(YarnResourceManager.java:370)
>at
>
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)
>at
>
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)
>at
>
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
>at
>
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
>at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
>at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
>at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
>at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
>at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
>at akka.actor.ActorCell.invoke(ActorCell.scala:561)
>at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
>at akka.dispatch.Mailbox.run(Mailbox.scala:225)
>at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
>at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>at
>
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>at
>
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 2020-08-04 10:30:14,476 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph- Could not
> restart the job flink2Ots (e11a22af324049217fdff28aca9f73a5) because the
> restart strategy prevented it.
> java.lang.Exception: Container released on a *lost* node
>at
>
> org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted$0(YarnResourceManager.java:370)
>at
>
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)
>at
>
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)
>at
>
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
>at
>
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
>at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
>at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
>at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
>at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
>at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
>at akka.actor.ActorCell.invoke(ActorCell.scala:561)
>at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
>at akka.dispatch.Mailbox.run(Mailbox.scala:225)
>at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
>at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>at
>
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>at
>
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 2020-08-04 10:30:14,476 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Stopping
> checkpoint coordinator for job e11a22af324049217fdff28aca9f73a5.
> 2020-08-04 10:30:14,476 INFO
> org.apache.flink.runtime.checkpoint.StandaloneCompletedCheckpointStore  -
> Shutting down
>
> 但是我之前也遇到过这个错误时，yarn上的application是可以退出的。
>
>
>
> -
> Best Wishes
> --
> Sent from: http://apache-flink.147419.n8.nabble.com/
>

Re: flink1.9.1任务已经fail掉了，但在yarn上这个application还是在running

2020-08-04 文章 bradyMk

您好
我这边是用perJob的方式提交的，而且这种现象还是偶发性的，这次错误日志是这样的：

2020-08-04 10:30:14,475 INFO 
org.apache.flink.runtime.executiongraph.ExecutionGraph- Job
flink2Ots (e11a22af324049217fdff28aca9f73a5) switched from state FAILING to
FAILED.
java.lang.Exception: Container released on a *lost* node
at
org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted$0(YarnResourceManager.java:370)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)
at
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2020-08-04 10:30:14,476 INFO 
org.apache.flink.runtime.executiongraph.ExecutionGraph- Could not
restart the job flink2Ots (e11a22af324049217fdff28aca9f73a5) because the
restart strategy prevented it.
java.lang.Exception: Container released on a *lost* node
at
org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted$0(YarnResourceManager.java:370)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)
at
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2020-08-04 10:30:14,476 INFO 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Stopping
checkpoint coordinator for job e11a22af324049217fdff28aca9f73a5.
2020-08-04 10:30:14,476 INFO 
org.apache.flink.runtime.checkpoint.StandaloneCompletedCheckpointStore  -
Shutting down

但是我之前也遇到过这个错误时，yarn上的application是可以退出的。



-
Best Wishes
--
Sent from: http://apache-flink.147419.n8.nabble.com/

Re: flink1.9.1任务已经fail掉了，但在yarn上这个application还是在running

2020-08-04 文章 Yang Wang

我怀疑你起的是一个session cluster，如果是perjob的任务，job失败以后application是一定会退出的

你可以把jobmanager的log发一下，这样方便排查问题


Best,
Yang

bradyMk  于2020年8月4日周二 下午2:35写道：

> 您好
> JM应该还在运行，因为Web Ui还可以看，但是我想知道我这个任务明明已经挂掉了，为什么JM还在运行着？这个需要配置什么参数去解决么？
>
>
>
> -
> Best Wishes
> --
> Sent from: http://apache-flink.147419.n8.nabble.com/

Re: flink1.9.1任务已经fail掉了，但在yarn上这个application还是在running

2020-08-04 文章 bradyMk

您好
JM应该还在运行，因为Web Ui还可以看，但是我想知道我这个任务明明已经挂掉了，为什么JM还在运行着？这个需要配置什么参数去解决么？



-
Best Wishes
--
Sent from: http://apache-flink.147419.n8.nabble.com/

Re: flink1.9.1任务已经fail掉了，但在yarn上这个application还是在running

2020-08-03 文章 Congxian Qiu

Hi
   或许你可以看一下 Flink 作业的 JM 是不是还在运行着？
Best,
Congxian


bradyMk  于2020年8月4日周二 上午11:38写道：

> 请教大家：
>
> flink1.9.1任务已经fail掉了，但在yarn上这个application还是在running，且yarn上分配的资源变成了1，程序中用的是固定延迟重启策略，请问有人知道任务挂掉但yarn上一直在running是什么原因么？
> <
> http://apache-flink.147419.n8.nabble.com/file/t802/Inked%E6%8D%95%E8%8E%B711_LI.jpg>
>
> 
>
>
>
>
> -
> Best Wishes
> --
> Sent from: http://apache-flink.147419.n8.nabble.com/
>

Re: Re: flink1.9.1任务已经fail掉了，但在yarn上这个application还是在running

Re: flink1.9.1任务已经fail掉了，但在yarn上这个application还是在running

Re: flink1.9.1任务已经fail掉了，但在yarn上这个application还是在running

Re: flink1.9.1任务已经fail掉了，但在yarn上这个application还是在running

Re: flink1.9.1任务已经fail掉了，但在yarn上这个application还是在running

Re: flink1.9.1任务已经fail掉了，但在yarn上这个application还是在running

Re: flink1.9.1任务已经fail掉了，但在yarn上这个application还是在running

Re: flink1.9.1任务已经fail掉了，但在yarn上这个application还是在running

Re: flink1.9.1任务已经fail掉了，但在yarn上这个application还是在running

Re: flink1.9.1任务已经fail掉了，但在yarn上这个application还是在running

Re: flink1.9.1任务已经fail掉了，但在yarn上这个application还是在running

11 matches

Site Navigation

Mail list logo

Footer information