Hi,大佬们
Flink job经常不定期重启,看了异常日志基本都是下面这种,可以帮忙解释下什么原因吗? 2020-07-01 20:20:43.875 [flink-akka.actor.default-dispatcher-27] INFO akka.remote.RemoteActorRefProvider$RemotingTerminator flink-akka.remote.default-remote-dispatcher-22 - Remoting shut down. 2020-07-01 20:20:43.875 [flink-akka.actor.default-dispatcher-27] INFO akka.remote.RemoteActorRefProvider$RemotingTerminator flink-akka.remote.default-remote-dispatcher-22 - Remoting shut down. 2020-07-01 20:20:43.875 [flink-metrics-16] INFO akka.remote.RemoteActorRefProvider$RemotingTerminator flink-metrics-akka.remote.default-remote-dispatcher-14 - Remoting shut down. 2020-07-01 20:20:43.875 [flink-metrics-16] INFO akka.remote.RemoteActorRefProvider$RemotingTerminator flink-metrics-akka.remote.default-remote-dispatcher-14 - Remoting shut down. 2020-07-01 20:20:43.891 [flink-metrics-16] INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Stopped Akka RPC service. 2020-07-01 20:20:43.895 [flink-akka.actor.default-dispatcher-15] INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Terminating cluster entrypoint process YarnJobClusterEntrypoint with exit code 2. java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/resourcemanager#-781959047]] after [10000 ms]. Message of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply. at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593) at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:871) at akka.dispatch.OnComplete.internal(Future.scala:263) at akka.dispatch.OnComplete.internal(Future.scala:261) at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191) at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74) at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:644) at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205) at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:599) at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:597) at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328) at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279) at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283) at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235) at java.lang.Thread.run(Thread.java:745) Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/resourcemanager#-781959047]] after [10000 ms]. Message of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply. at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635) at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635) at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648) ... 9 common frames omitted |
从报错信息看是 Akka 的 RPC 调用超时,因为是 LocalFencedMessage 所以基本上可以排除网络问题。
建议看一下 JM 进程的 GC 压力以及线程数量,是否存在压力过大 RPC 来不及响应的情况。 Thank you~ Xintong Song On Fri, Jul 3, 2020 at 10:48 AM noon cjihg <[hidden email]> wrote: > Hi,大佬们 > > Flink job经常不定期重启,看了异常日志基本都是下面这种,可以帮忙解释下什么原因吗? > > 2020-07-01 20:20:43.875 [flink-akka.actor.default-dispatcher-27] INFO > akka.remote.RemoteActorRefProvider$RemotingTerminator > flink-akka.remote.default-remote-dispatcher-22 - Remoting shut down. > 2020-07-01 20:20:43.875 [flink-akka.actor.default-dispatcher-27] INFO > akka.remote.RemoteActorRefProvider$RemotingTerminator > flink-akka.remote.default-remote-dispatcher-22 - Remoting shut down. > 2020-07-01 20:20:43.875 [flink-metrics-16] INFO > akka.remote.RemoteActorRefProvider$RemotingTerminator > flink-metrics-akka.remote.default-remote-dispatcher-14 - Remoting shut > down. > 2020-07-01 20:20:43.875 [flink-metrics-16] INFO > akka.remote.RemoteActorRefProvider$RemotingTerminator > flink-metrics-akka.remote.default-remote-dispatcher-14 - Remoting shut > down. > 2020-07-01 20:20:43.891 [flink-metrics-16] INFO > org.apache.flink.runtime.rpc.akka.AkkaRpcService - Stopped Akka RPC > service. > 2020-07-01 20:20:43.895 [flink-akka.actor.default-dispatcher-15] INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Terminating > cluster entrypoint process YarnJobClusterEntrypoint with exit code 2. > java.util.concurrent.CompletionException: > akka.pattern.AskTimeoutException: Ask timed out on > [Actor[akka://flink/user/resourcemanager#-781959047]] after [10000 > ms]. Message of type > [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical > reason for `AskTimeoutException` is that the recipient actor didn't > send a reply. > at > java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) > at > java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) > at > java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593) > at > java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) > at > org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:871) > at akka.dispatch.OnComplete.internal(Future.scala:263) > at akka.dispatch.OnComplete.internal(Future.scala:261) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > at > org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) > at > akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:644) > at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205) > at > scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:599) > at > scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) > at > scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:597) > at > akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328) > at > akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279) > at > akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283) > at > akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235) > at java.lang.Thread.run(Thread.java:745) > Caused by: akka.pattern.AskTimeoutException: Ask timed out on > [Actor[akka://flink/user/resourcemanager#-781959047]] after [10000 > ms]. Message of type > [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical > reason for `AskTimeoutException` is that the recipient actor didn't > send a reply. > at > akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635) > at > akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635) > at > akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648) > ... 9 common frames omitted > |
我们集群一般出现这种异常大都是因为 Full GC 次数比较多,然后最后伴随着就是 TaskManager 挂掉的异常
Xintong Song <[hidden email]> 于2020年7月3日周五 上午11:06写道: > 从报错信息看是 Akka 的 RPC 调用超时,因为是 LocalFencedMessage 所以基本上可以排除网络问题。 > 建议看一下 JM 进程的 GC 压力以及线程数量,是否存在压力过大 RPC 来不及响应的情况。 > > Thank you~ > > Xintong Song > > > > On Fri, Jul 3, 2020 at 10:48 AM noon cjihg <[hidden email]> wrote: > > > Hi,大佬们 > > > > Flink job经常不定期重启,看了异常日志基本都是下面这种,可以帮忙解释下什么原因吗? > > > > 2020-07-01 20:20:43.875 [flink-akka.actor.default-dispatcher-27] INFO > > akka.remote.RemoteActorRefProvider$RemotingTerminator > > flink-akka.remote.default-remote-dispatcher-22 - Remoting shut down. > > 2020-07-01 20:20:43.875 [flink-akka.actor.default-dispatcher-27] INFO > > akka.remote.RemoteActorRefProvider$RemotingTerminator > > flink-akka.remote.default-remote-dispatcher-22 - Remoting shut down. > > 2020-07-01 20:20:43.875 [flink-metrics-16] INFO > > akka.remote.RemoteActorRefProvider$RemotingTerminator > > flink-metrics-akka.remote.default-remote-dispatcher-14 - Remoting shut > > down. > > 2020-07-01 20:20:43.875 [flink-metrics-16] INFO > > akka.remote.RemoteActorRefProvider$RemotingTerminator > > flink-metrics-akka.remote.default-remote-dispatcher-14 - Remoting shut > > down. > > 2020-07-01 20:20:43.891 [flink-metrics-16] INFO > > org.apache.flink.runtime.rpc.akka.AkkaRpcService - Stopped Akka RPC > > service. > > 2020-07-01 20:20:43.895 [flink-akka.actor.default-dispatcher-15] INFO > > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Terminating > > cluster entrypoint process YarnJobClusterEntrypoint with exit code 2. > > java.util.concurrent.CompletionException: > > akka.pattern.AskTimeoutException: Ask timed out on > > [Actor[akka://flink/user/resourcemanager#-781959047]] after [10000 > > ms]. Message of type > > [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical > > reason for `AskTimeoutException` is that the recipient actor didn't > > send a reply. > > at > > > java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) > > at > > > java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) > > at > > > java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593) > > at > > > java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) > > at > > > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > > at > > > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) > > at > > > org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:871) > > at akka.dispatch.OnComplete.internal(Future.scala:263) > > at akka.dispatch.OnComplete.internal(Future.scala:261) > > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191) > > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188) > > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > > at > > > org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74) > > at > > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) > > at > > > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) > > at > > > akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:644) > > at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205) > > at > > > scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:599) > > at > > > scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) > > at > > > scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:597) > > at > > > akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328) > > at > > > akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279) > > at > > > akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283) > > at > > > akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235) > > at java.lang.Thread.run(Thread.java:745) > > Caused by: akka.pattern.AskTimeoutException: Ask timed out on > > [Actor[akka://flink/user/resourcemanager#-781959047]] after [10000 > > ms]. Message of type > > [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical > > reason for `AskTimeoutException` is that the recipient actor didn't > > send a reply. > > at > > akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635) > > at > > akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635) > > at > > > akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648) > > ... 9 common frames omitted > > > |
Free forum by Nabble | Edit this page |