CONTENTS DELETED
The author has deleted this message.
|
Hi
或许你可以看一下 Flink 作业的 JM 是不是还在运行着? Best, Congxian bradyMk <[hidden email]> 于2020年8月4日周二 上午11:38写道: > 请教大家: > > flink1.9.1任务已经fail掉了,但在yarn上这个application还是在running,且yarn上分配的资源变成了1,程序中用的是固定延迟重启策略,请问有人知道任务挂掉但yarn上一直在running是什么原因么? > < > http://apache-flink.147419.n8.nabble.com/file/t802/Inked%E6%8D%95%E8%8E%B711_LI.jpg> > > <http://apache-flink.147419.n8.nabble.com/file/t802/%E6%8D%95%E8%8E%B7.png> > > > > > ----- > Best Wishes > -- > Sent from: http://apache-flink.147419.n8.nabble.com/ > |
CONTENTS DELETED
The author has deleted this message.
|
我怀疑你起的是一个session cluster,如果是perjob的任务,job失败以后application是一定会退出的
你可以把jobmanager的log发一下,这样方便排查问题 Best, Yang bradyMk <[hidden email]> 于2020年8月4日周二 下午2:35写道: > 您好 > JM应该还在运行,因为Web Ui还可以看,但是我想知道我这个任务明明已经挂掉了,为什么JM还在运行着?这个需要配置什么参数去解决么? > > > > ----- > Best Wishes > -- > Sent from: http://apache-flink.147419.n8.nabble.com/ |
CONTENTS DELETED
The author has deleted this message.
|
@bradyMk,你可以把完整的JM
log发一下吗,这样我们能看一下Flink的YarnResourceManager为什么没有执行deregister的逻辑 @JasonLee,你说的bug是什么呢,已经有对应的JIRA了吗 Best, Yang JasonLee <[hidden email]> 于2020年8月4日周二 下午4:33写道: > hi > 这本身就是一个bug 应该是还没有修复 > > > | | > JasonLee > | > | > 邮箱:[hidden email] > | > > Signature is customized by Netease Mail Master > > 在2020年08月04日 15:41,bradyMk 写道: > 您好 > 我这边是用perJob的方式提交的,而且这种现象还是偶发性的,这次错误日志是这样的: > > 2020-08-04 10:30:14,475 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Job > flink2Ots (e11a22af324049217fdff28aca9f73a5) switched from state FAILING to > FAILED. > java.lang.Exception: Container released on a *lost* node > at > > org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted$0(YarnResourceManager.java:370) > at > > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397) > at > > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190) > at > > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) > at > > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) > at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > at akka.actor.Actor$class.aroundReceive(Actor.scala:517) > at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) > at akka.actor.ActorCell.invoke(ActorCell.scala:561) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) > at akka.dispatch.Mailbox.run(Mailbox.scala:225) > at akka.dispatch.Mailbox.exec(Mailbox.scala:235) > at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > > akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > > akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > 2020-08-04 10:30:14,476 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Could not > restart the job flink2Ots (e11a22af324049217fdff28aca9f73a5) because the > restart strategy prevented it. > java.lang.Exception: Container released on a *lost* node > at > > org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted$0(YarnResourceManager.java:370) > at > > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397) > at > > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190) > at > > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) > at > > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) > at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > at akka.actor.Actor$class.aroundReceive(Actor.scala:517) > at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) > at akka.actor.ActorCell.invoke(ActorCell.scala:561) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) > at akka.dispatch.Mailbox.run(Mailbox.scala:225) > at akka.dispatch.Mailbox.exec(Mailbox.scala:235) > at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > > akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > > akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > 2020-08-04 10:30:14,476 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Stopping > checkpoint coordinator for job e11a22af324049217fdff28aca9f73a5. > 2020-08-04 10:30:14,476 INFO > org.apache.flink.runtime.checkpoint.StandaloneCompletedCheckpointStore - > Shutting down > > 但是我之前也遇到过这个错误时,yarn上的application是可以退出的。 > > > > ----- > Best Wishes > -- > Sent from: http://apache-flink.147419.n8.nabble.com/ > |
hi
我记得我用1.6.0版本的时候就有这个问题 好像是没有对应的jira 不过我用新版本已经没有遇到这个问题了 应该是偶尔会出现 -- Sent from: http://apache-flink.147419.n8.nabble.com/
Best Wishes
JasonLee |
In reply to this post by Yang Wang
CONTENTS DELETED
The author has deleted this message.
|
In reply to this post by JasonLee
CONTENTS DELETED
The author has deleted this message.
|
你的Flink任务应该是用attach的方式起的,也就是没有加-d,这种情况在1.10之前起的任务本质上是一个session,
只有当结果被client端retrieve走以后,才会退出,如果client挂了或者你主动停掉了,那就会留下一个空的session 你可以通过如下log确认起的session模式 2020-08-04 10:45:36,868 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Starting YarnSessionClusterEntrypoint (Version: 1.9.1, Rev:f23f82a, Date:01.11.2019 @ 11:20:33 CST) 你可以flink run -d ...就是perjob模式了,或者升级到1.10及以后版本attach/detach都是真正的perjob Best, Yang bradyMk <[hidden email]> 于2020年8月4日周二 下午8:04写道: > 您好: > 请问这是flink这个版本自身的bug么?那就意味着没有办法解决了吧,只能手动kill掉? > > > > ----- > Best Wishes > -- > Sent from: http://apache-flink.147419.n8.nabble.com/ |
CONTENTS DELETED
The author has deleted this message.
|
In reply to this post by Yang Wang
hi,
我现在的版本是flink-1.11.1没有加-d参数,也遇见了同样的问题,不知道是什么情况呢? best, amenhub 发件人: Yang Wang 发送时间: 2020-08-05 10:28 收件人: user-zh 主题: Re: flink1.9.1任务已经fail掉了,但在yarn上这个application还是在running 你的Flink任务应该是用attach的方式起的,也就是没有加-d,这种情况在1.10之前起的任务本质上是一个session, 只有当结果被client端retrieve走以后,才会退出,如果client挂了或者你主动停掉了,那就会留下一个空的session 你可以通过如下log确认起的session模式 2020-08-04 10:45:36,868 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Starting YarnSessionClusterEntrypoint (Version: 1.9.1, Rev:f23f82a, Date:01.11.2019 @ 11:20:33 CST) 你可以flink run -d ...就是perjob模式了,或者升级到1.10及以后版本attach/detach都是真正的perjob Best, Yang bradyMk <[hidden email]> 于2020年8月4日周二 下午8:04写道: > 您好: > 请问这是flink这个版本自身的bug么?那就意味着没有办法解决了吧,只能手动kill掉? > > > > ----- > Best Wishes > -- > Sent from: http://apache-flink.147419.n8.nabble.com/ |
Free forum by Nabble | Edit this page |