Flink job不定期就会重启,版本是1.9

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink job不定期就会重启,版本是1.9

noon cjihg
Hi,大佬们

Flink job经常不定期重启,看了异常日志基本都是下面这种,可以帮忙解释下什么原因吗?

2020-07-01 20:20:43.875 [flink-akka.actor.default-dispatcher-27] INFO
akka.remote.RemoteActorRefProvider$RemotingTerminator
flink-akka.remote.default-remote-dispatcher-22 - Remoting shut down.
2020-07-01 20:20:43.875 [flink-akka.actor.default-dispatcher-27] INFO
akka.remote.RemoteActorRefProvider$RemotingTerminator
flink-akka.remote.default-remote-dispatcher-22 - Remoting shut down.
2020-07-01 20:20:43.875 [flink-metrics-16] INFO
akka.remote.RemoteActorRefProvider$RemotingTerminator
flink-metrics-akka.remote.default-remote-dispatcher-14 - Remoting shut
down.
2020-07-01 20:20:43.875 [flink-metrics-16] INFO
akka.remote.RemoteActorRefProvider$RemotingTerminator
flink-metrics-akka.remote.default-remote-dispatcher-14 - Remoting shut
down.
2020-07-01 20:20:43.891 [flink-metrics-16] INFO
org.apache.flink.runtime.rpc.akka.AkkaRpcService  - Stopped Akka RPC
service.
2020-07-01 20:20:43.895 [flink-akka.actor.default-dispatcher-15] INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Terminating
cluster entrypoint process YarnJobClusterEntrypoint with exit code 2.
java.util.concurrent.CompletionException:
akka.pattern.AskTimeoutException: Ask timed out on
[Actor[akka://flink/user/resourcemanager#-781959047]] after [10000
ms]. Message of type
[org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical
reason for `AskTimeoutException` is that the recipient actor didn't
send a reply.
        at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
        at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
        at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
        at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
        at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:871)
        at akka.dispatch.OnComplete.internal(Future.scala:263)
        at akka.dispatch.OnComplete.internal(Future.scala:261)
        at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191)
        at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188)
        at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
        at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)
        at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
        at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
        at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:644)
        at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
        at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:599)
        at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
        at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:597)
        at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
        at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
        at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
        at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
        at java.lang.Thread.run(Thread.java:745)
Caused by: akka.pattern.AskTimeoutException: Ask timed out on
[Actor[akka://flink/user/resourcemanager#-781959047]] after [10000
ms]. Message of type
[org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical
reason for `AskTimeoutException` is that the recipient actor didn't
send a reply.
        at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
        at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
        at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)
        ... 9 common frames omitted
Reply | Threaded
Open this post in threaded view
|

Re: Flink job不定期就会重启,版本是1.9

Xintong Song
从报错信息看是 Akka 的 RPC 调用超时,因为是 LocalFencedMessage 所以基本上可以排除网络问题。
建议看一下 JM 进程的 GC 压力以及线程数量,是否存在压力过大 RPC 来不及响应的情况。

Thank you~

Xintong Song



On Fri, Jul 3, 2020 at 10:48 AM noon cjihg <[hidden email]> wrote:

> Hi,大佬们
>
> Flink job经常不定期重启,看了异常日志基本都是下面这种,可以帮忙解释下什么原因吗?
>
> 2020-07-01 20:20:43.875 [flink-akka.actor.default-dispatcher-27] INFO
> akka.remote.RemoteActorRefProvider$RemotingTerminator
> flink-akka.remote.default-remote-dispatcher-22 - Remoting shut down.
> 2020-07-01 20:20:43.875 [flink-akka.actor.default-dispatcher-27] INFO
> akka.remote.RemoteActorRefProvider$RemotingTerminator
> flink-akka.remote.default-remote-dispatcher-22 - Remoting shut down.
> 2020-07-01 20:20:43.875 [flink-metrics-16] INFO
> akka.remote.RemoteActorRefProvider$RemotingTerminator
> flink-metrics-akka.remote.default-remote-dispatcher-14 - Remoting shut
> down.
> 2020-07-01 20:20:43.875 [flink-metrics-16] INFO
> akka.remote.RemoteActorRefProvider$RemotingTerminator
> flink-metrics-akka.remote.default-remote-dispatcher-14 - Remoting shut
> down.
> 2020-07-01 20:20:43.891 [flink-metrics-16] INFO
> org.apache.flink.runtime.rpc.akka.AkkaRpcService  - Stopped Akka RPC
> service.
> 2020-07-01 20:20:43.895 [flink-akka.actor.default-dispatcher-15] INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Terminating
> cluster entrypoint process YarnJobClusterEntrypoint with exit code 2.
> java.util.concurrent.CompletionException:
> akka.pattern.AskTimeoutException: Ask timed out on
> [Actor[akka://flink/user/resourcemanager#-781959047]] after [10000
> ms]. Message of type
> [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical
> reason for `AskTimeoutException` is that the recipient actor didn't
> send a reply.
>         at
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
>         at
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
>         at
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
>         at
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
>         at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>         at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>         at
> org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:871)
>         at akka.dispatch.OnComplete.internal(Future.scala:263)
>         at akka.dispatch.OnComplete.internal(Future.scala:261)
>         at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191)
>         at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188)
>         at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
>         at
> org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)
>         at
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
>         at
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
>         at
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:644)
>         at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
>         at
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:599)
>         at
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
>         at
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:597)
>         at
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
>         at
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
>         at
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
>         at
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: akka.pattern.AskTimeoutException: Ask timed out on
> [Actor[akka://flink/user/resourcemanager#-781959047]] after [10000
> ms]. Message of type
> [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical
> reason for `AskTimeoutException` is that the recipient actor didn't
> send a reply.
>         at
> akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
>         at
> akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
>         at
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)
>         ... 9 common frames omitted
>
Reply | Threaded
Open this post in threaded view
|

Re: Flink job不定期就会重启,版本是1.9

zhisheng
我们集群一般出现这种异常大都是因为 Full GC 次数比较多,然后最后伴随着就是 TaskManager 挂掉的异常

Xintong Song <[hidden email]> 于2020年7月3日周五 上午11:06写道:

> 从报错信息看是 Akka 的 RPC 调用超时,因为是 LocalFencedMessage 所以基本上可以排除网络问题。
> 建议看一下 JM 进程的 GC 压力以及线程数量,是否存在压力过大 RPC 来不及响应的情况。
>
> Thank you~
>
> Xintong Song
>
>
>
> On Fri, Jul 3, 2020 at 10:48 AM noon cjihg <[hidden email]> wrote:
>
> > Hi,大佬们
> >
> > Flink job经常不定期重启,看了异常日志基本都是下面这种,可以帮忙解释下什么原因吗?
> >
> > 2020-07-01 20:20:43.875 [flink-akka.actor.default-dispatcher-27] INFO
> > akka.remote.RemoteActorRefProvider$RemotingTerminator
> > flink-akka.remote.default-remote-dispatcher-22 - Remoting shut down.
> > 2020-07-01 20:20:43.875 [flink-akka.actor.default-dispatcher-27] INFO
> > akka.remote.RemoteActorRefProvider$RemotingTerminator
> > flink-akka.remote.default-remote-dispatcher-22 - Remoting shut down.
> > 2020-07-01 20:20:43.875 [flink-metrics-16] INFO
> > akka.remote.RemoteActorRefProvider$RemotingTerminator
> > flink-metrics-akka.remote.default-remote-dispatcher-14 - Remoting shut
> > down.
> > 2020-07-01 20:20:43.875 [flink-metrics-16] INFO
> > akka.remote.RemoteActorRefProvider$RemotingTerminator
> > flink-metrics-akka.remote.default-remote-dispatcher-14 - Remoting shut
> > down.
> > 2020-07-01 20:20:43.891 [flink-metrics-16] INFO
> > org.apache.flink.runtime.rpc.akka.AkkaRpcService  - Stopped Akka RPC
> > service.
> > 2020-07-01 20:20:43.895 [flink-akka.actor.default-dispatcher-15] INFO
> > org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Terminating
> > cluster entrypoint process YarnJobClusterEntrypoint with exit code 2.
> > java.util.concurrent.CompletionException:
> > akka.pattern.AskTimeoutException: Ask timed out on
> > [Actor[akka://flink/user/resourcemanager#-781959047]] after [10000
> > ms]. Message of type
> > [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical
> > reason for `AskTimeoutException` is that the recipient actor didn't
> > send a reply.
> >         at
> >
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
> >         at
> >
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
> >         at
> >
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
> >         at
> >
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
> >         at
> >
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> >         at
> >
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
> >         at
> >
> org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:871)
> >         at akka.dispatch.OnComplete.internal(Future.scala:263)
> >         at akka.dispatch.OnComplete.internal(Future.scala:261)
> >         at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191)
> >         at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188)
> >         at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
> >         at
> >
> org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)
> >         at
> > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
> >         at
> >
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
> >         at
> >
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:644)
> >         at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
> >         at
> >
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:599)
> >         at
> >
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
> >         at
> >
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:597)
> >         at
> >
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
> >         at
> >
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
> >         at
> >
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
> >         at
> >
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
> >         at java.lang.Thread.run(Thread.java:745)
> > Caused by: akka.pattern.AskTimeoutException: Ask timed out on
> > [Actor[akka://flink/user/resourcemanager#-781959047]] after [10000
> > ms]. Message of type
> > [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical
> > reason for `AskTimeoutException` is that the recipient actor didn't
> > send a reply.
> >         at
> > akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> >         at
> > akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> >         at
> >
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)
> >         ... 9 common frames omitted
> >
>