如题,我有个任务频繁发生该异常然后重启。今天任务启动1h后,看了下WEB-UI的检查点也没,restored达到了8已经。然后Exception页面显示该错误,估计大多数都是因为该错误导致的restore。
除此外,就是 ‘Job leader for job id eb5d2893c4c6f4034995b9c8e180f01e lost leadership’ 错导致任务重启。 下面给出刚刚的一个错误日志(环境flink1.12,standalone集群,5JM+5TM,JM和TM混部在相同机器): 2021-03-08 14:31:40 org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Error at remote task manager '10.35.185.38/10.35.185.38:2016'. at org.apache.flink.runtime.io.network.netty. CreditBasedPartitionRequestClientHandler.decodeMsg( CreditBasedPartitionRequestClientHandler.java:294) at org.apache.flink.runtime.io.network.netty. CreditBasedPartitionRequestClientHandler.channelRead( CreditBasedPartitionRequestClientHandler.java:183) at org.apache.flink.shaded.netty4.io.netty.channel. AbstractChannelHandlerContext.invokeChannelRead( AbstractChannelHandlerContext.java:379) at org.apache.flink.shaded.netty4.io.netty.channel. AbstractChannelHandlerContext.invokeChannelRead( AbstractChannelHandlerContext.java:365) at org.apache.flink.shaded.netty4.io.netty.channel. AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext .java:357) at org.apache.flink.runtime.io.network.netty. NettyMessageClientDecoderDelegate.channelRead( NettyMessageClientDecoderDelegate.java:115) at org.apache.flink.shaded.netty4.io.netty.channel. AbstractChannelHandlerContext.invokeChannelRead( AbstractChannelHandlerContext.java:379) at org.apache.flink.shaded.netty4.io.netty.channel. AbstractChannelHandlerContext.invokeChannelRead( AbstractChannelHandlerContext.java:365) at org.apache.flink.shaded.netty4.io.netty.channel. AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext .java:357) at org.apache.flink.shaded.netty4.io.netty.channel. DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java: 1410) at org.apache.flink.shaded.netty4.io.netty.channel. AbstractChannelHandlerContext.invokeChannelRead( AbstractChannelHandlerContext.java:379) at org.apache.flink.shaded.netty4.io.netty.channel. AbstractChannelHandlerContext.invokeChannelRead( AbstractChannelHandlerContext.java:365) at org.apache.flink.shaded.netty4.io.netty.channel. DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) at org.apache.flink.shaded.netty4.io.netty.channel.epoll. AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady( AbstractEpollStreamChannel.java:792) at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop .processReady(EpollEventLoop.java:475) at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop .run(EpollEventLoop.java:378) at org.apache.flink.shaded.netty4.io.netty.util.concurrent. SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at org.apache.flink.shaded.netty4.io.netty.util.internal. ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.flink.runtime.io.network.partition. ProducerFailedException: org.apache.flink.util.FlinkException: JobManager responsible for eb5d2893c4c6f4034995b9c8e180f01e lost the leadership. at org.apache.flink.runtime.io.network.netty.PartitionRequestQueue .writeAndFlushNextMessageIfPossible(PartitionRequestQueue.java:221) at org.apache.flink.runtime.io.network.netty.PartitionRequestQueue .enqueueAvailableReader(PartitionRequestQueue.java:108) at org.apache.flink.runtime.io.network.netty.PartitionRequestQueue .userEventTriggered(PartitionRequestQueue.java:170) at org.apache.flink.shaded.netty4.io.netty.channel. AbstractChannelHandlerContext.invokeUserEventTriggered( AbstractChannelHandlerContext.java:346) at org.apache.flink.shaded.netty4.io.netty.channel. AbstractChannelHandlerContext.invokeUserEventTriggered( AbstractChannelHandlerContext.java:332) at org.apache.flink.shaded.netty4.io.netty.channel. AbstractChannelHandlerContext.fireUserEventTriggered( AbstractChannelHandlerContext.java:324) at org.apache.flink.shaded.netty4.io.netty.channel. ChannelInboundHandlerAdapter.userEventTriggered(ChannelInboundHandlerAdapter .java:117) at org.apache.flink.shaded.netty4.io.netty.handler.codec. ByteToMessageDecoder.userEventTriggered(ByteToMessageDecoder.java:365) at org.apache.flink.shaded.netty4.io.netty.channel. AbstractChannelHandlerContext.invokeUserEventTriggered( AbstractChannelHandlerContext.java:346) at org.apache.flink.shaded.netty4.io.netty.channel. AbstractChannelHandlerContext.invokeUserEventTriggered( AbstractChannelHandlerContext.java:332) at org.apache.flink.shaded.netty4.io.netty.channel. AbstractChannelHandlerContext.fireUserEventTriggered( AbstractChannelHandlerContext.java:324) at org.apache.flink.shaded.netty4.io.netty.channel. DefaultChannelPipeline$HeadContext.userEventTriggered(DefaultChannelPipeline .java:1428) at org.apache.flink.shaded.netty4.io.netty.channel. AbstractChannelHandlerContext.invokeUserEventTriggered( AbstractChannelHandlerContext.java:346) at org.apache.flink.shaded.netty4.io.netty.channel. AbstractChannelHandlerContext.invokeUserEventTriggered( AbstractChannelHandlerContext.java:332) at org.apache.flink.shaded.netty4.io.netty.channel. DefaultChannelPipeline.fireUserEventTriggered(DefaultChannelPipeline.java: 913) at org.apache.flink.runtime.io.network.netty.PartitionRequestQueue .lambda$notifyReaderNonEmpty$0(PartitionRequestQueue.java:87) at org.apache.flink.shaded.netty4.io.netty.util.concurrent. AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at org.apache.flink.shaded.netty4.io.netty.util.concurrent. SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop .run(EpollEventLoop.java:387) ... 3 more Caused by: org.apache.flink.util.FlinkException: JobManager responsible for eb5d2893c4c6f4034995b9c8e180f01e lost the leadership. at org.apache.flink.runtime.taskexecutor.TaskExecutor .disconnectJobManagerConnection(TaskExecutor.java:1422) at org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1300( TaskExecutor.java:174) at org.apache.flink.runtime.taskexecutor. TaskExecutor$JobLeaderListenerImpl.lambda$null$2(TaskExecutor.java:1856) at java.util.Optional.ifPresent(Optional.java:159) at org.apache.flink.runtime.taskexecutor. TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$3( TaskExecutor.java:1855) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync( AkkaRpcActor.java:404) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage( AkkaRpcActor.java:197) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage( AkkaRpcActor.java:154) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at akka.actor.Actor$class.aroundReceive(Actor.scala:517) at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) at akka.actor.ActorCell.invoke(ActorCell.scala:561) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) at akka.dispatch.Mailbox.run(Mailbox.scala:225) at akka.dispatch.Mailbox.exec(Mailbox.scala:235) at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool .java:1339) at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread .java:107) Caused by: java.lang.Exception: Job leader for job id eb5d2893c4c6f4034995b9c8e180f01e lost leadership. ... 24 more (1)zookeeper的超时设置的是60s,感觉网络异常zk超时不至于60s都不够。 (2)akka.ask.timeout: 60s taskmanager.network.request-backoff.max: 60000 akka此参数之前也调整为60s了。 如上信息,希望社区同学们给点思路。 |
Hi,
可以排查下 GC 情况,频繁 FGC 也会导致这些情况。 Best, jjiey > 2021年3月8日 14:37,yidan zhao <[hidden email]> 写道: > > 如题,我有个任务频繁发生该异常然后重启。今天任务启动1h后,看了下WEB-UI的检查点也没,restored达到了8已经。然后Exception页面显示该错误,估计大多数都是因为该错误导致的restore。 > 除此外,就是 ‘Job leader for job id eb5d2893c4c6f4034995b9c8e180f01e lost > leadership’ 错导致任务重启。 > > 下面给出刚刚的一个错误日志(环境flink1.12,standalone集群,5JM+5TM,JM和TM混部在相同机器): > 2021-03-08 14:31:40 > org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: > Error at remote task manager '10.35.185.38/10.35.185.38:2016'. > at org.apache.flink.runtime.io.network.netty. > CreditBasedPartitionRequestClientHandler.decodeMsg( > CreditBasedPartitionRequestClientHandler.java:294) > at org.apache.flink.runtime.io.network.netty. > CreditBasedPartitionRequestClientHandler.channelRead( > CreditBasedPartitionRequestClientHandler.java:183) > at org.apache.flink.shaded.netty4.io.netty.channel. > AbstractChannelHandlerContext.invokeChannelRead( > AbstractChannelHandlerContext.java:379) > at org.apache.flink.shaded.netty4.io.netty.channel. > AbstractChannelHandlerContext.invokeChannelRead( > AbstractChannelHandlerContext.java:365) > at org.apache.flink.shaded.netty4.io.netty.channel. > AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext > .java:357) > at org.apache.flink.runtime.io.network.netty. > NettyMessageClientDecoderDelegate.channelRead( > NettyMessageClientDecoderDelegate.java:115) > at org.apache.flink.shaded.netty4.io.netty.channel. > AbstractChannelHandlerContext.invokeChannelRead( > AbstractChannelHandlerContext.java:379) > at org.apache.flink.shaded.netty4.io.netty.channel. > AbstractChannelHandlerContext.invokeChannelRead( > AbstractChannelHandlerContext.java:365) > at org.apache.flink.shaded.netty4.io.netty.channel. > AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext > .java:357) > at org.apache.flink.shaded.netty4.io.netty.channel. > DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java: > 1410) > at org.apache.flink.shaded.netty4.io.netty.channel. > AbstractChannelHandlerContext.invokeChannelRead( > AbstractChannelHandlerContext.java:379) > at org.apache.flink.shaded.netty4.io.netty.channel. > AbstractChannelHandlerContext.invokeChannelRead( > AbstractChannelHandlerContext.java:365) > at org.apache.flink.shaded.netty4.io.netty.channel. > DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) > at org.apache.flink.shaded.netty4.io.netty.channel.epoll. > AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady( > AbstractEpollStreamChannel.java:792) > at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop > .processReady(EpollEventLoop.java:475) > at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop > .run(EpollEventLoop.java:378) > at org.apache.flink.shaded.netty4.io.netty.util.concurrent. > SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) > at org.apache.flink.shaded.netty4.io.netty.util.internal. > ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.flink.runtime.io.network.partition. > ProducerFailedException: org.apache.flink.util.FlinkException: JobManager > responsible for eb5d2893c4c6f4034995b9c8e180f01e lost the leadership. > at org.apache.flink.runtime.io.network.netty.PartitionRequestQueue > .writeAndFlushNextMessageIfPossible(PartitionRequestQueue.java:221) > at org.apache.flink.runtime.io.network.netty.PartitionRequestQueue > .enqueueAvailableReader(PartitionRequestQueue.java:108) > at org.apache.flink.runtime.io.network.netty.PartitionRequestQueue > .userEventTriggered(PartitionRequestQueue.java:170) > at org.apache.flink.shaded.netty4.io.netty.channel. > AbstractChannelHandlerContext.invokeUserEventTriggered( > AbstractChannelHandlerContext.java:346) > at org.apache.flink.shaded.netty4.io.netty.channel. > AbstractChannelHandlerContext.invokeUserEventTriggered( > AbstractChannelHandlerContext.java:332) > at org.apache.flink.shaded.netty4.io.netty.channel. > AbstractChannelHandlerContext.fireUserEventTriggered( > AbstractChannelHandlerContext.java:324) > at org.apache.flink.shaded.netty4.io.netty.channel. > ChannelInboundHandlerAdapter.userEventTriggered(ChannelInboundHandlerAdapter > .java:117) > at org.apache.flink.shaded.netty4.io.netty.handler.codec. > ByteToMessageDecoder.userEventTriggered(ByteToMessageDecoder.java:365) > at org.apache.flink.shaded.netty4.io.netty.channel. > AbstractChannelHandlerContext.invokeUserEventTriggered( > AbstractChannelHandlerContext.java:346) > at org.apache.flink.shaded.netty4.io.netty.channel. > AbstractChannelHandlerContext.invokeUserEventTriggered( > AbstractChannelHandlerContext.java:332) > at org.apache.flink.shaded.netty4.io.netty.channel. > AbstractChannelHandlerContext.fireUserEventTriggered( > AbstractChannelHandlerContext.java:324) > at org.apache.flink.shaded.netty4.io.netty.channel. > DefaultChannelPipeline$HeadContext.userEventTriggered(DefaultChannelPipeline > .java:1428) > at org.apache.flink.shaded.netty4.io.netty.channel. > AbstractChannelHandlerContext.invokeUserEventTriggered( > AbstractChannelHandlerContext.java:346) > at org.apache.flink.shaded.netty4.io.netty.channel. > AbstractChannelHandlerContext.invokeUserEventTriggered( > AbstractChannelHandlerContext.java:332) > at org.apache.flink.shaded.netty4.io.netty.channel. > DefaultChannelPipeline.fireUserEventTriggered(DefaultChannelPipeline.java: > 913) > at org.apache.flink.runtime.io.network.netty.PartitionRequestQueue > .lambda$notifyReaderNonEmpty$0(PartitionRequestQueue.java:87) > at org.apache.flink.shaded.netty4.io.netty.util.concurrent. > AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) > at org.apache.flink.shaded.netty4.io.netty.util.concurrent. > SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) > at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop > .run(EpollEventLoop.java:387) > ... 3 more > Caused by: org.apache.flink.util.FlinkException: JobManager responsible for > eb5d2893c4c6f4034995b9c8e180f01e lost the leadership. > at org.apache.flink.runtime.taskexecutor.TaskExecutor > .disconnectJobManagerConnection(TaskExecutor.java:1422) > at org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1300( > TaskExecutor.java:174) > at org.apache.flink.runtime.taskexecutor. > TaskExecutor$JobLeaderListenerImpl.lambda$null$2(TaskExecutor.java:1856) > at java.util.Optional.ifPresent(Optional.java:159) > at org.apache.flink.runtime.taskexecutor. > TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$3( > TaskExecutor.java:1855) > at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync( > AkkaRpcActor.java:404) > at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage( > AkkaRpcActor.java:197) > at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage( > AkkaRpcActor.java:154) > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) > at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > at akka.actor.Actor$class.aroundReceive(Actor.scala:517) > at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) > at akka.actor.ActorCell.invoke(ActorCell.scala:561) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) > at akka.dispatch.Mailbox.run(Mailbox.scala:225) > at akka.dispatch.Mailbox.exec(Mailbox.scala:235) > at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool > .java:1339) > at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread > .java:107) > Caused by: java.lang.Exception: Job leader for job id > eb5d2893c4c6f4034995b9c8e180f01e lost leadership. > ... 24 more > > > (1)zookeeper的超时设置的是60s,感觉网络异常zk超时不至于60s都不够。 > (2)akka.ask.timeout: 60s > taskmanager.network.request-backoff.max: 60000 > akka此参数之前也调整为60s了。 > > 如上信息,希望社区同学们给点思路。 > |
好的,我会看下。
然后我今天发现我好多个集群GC collector不一样。 目前发现3种,默认的是G1。flink conf中配置了env.java.opts: "-XX:-OmitStackTraceInFastThrow"的情况出现了2种,一种是Parallel GC with 83 threads,还有一种是Mark Sweep Compact GC。 大佬们,Flink是根据内存大小有什么动态调整吗。 不使用G1我大概理解了,可能设置了java.opts这个是覆盖,不是追加。本身我只是希望设置下-XX:-OmitStackTraceInFastThrow而已。 杨杰 <[hidden email]> 于2021年3月8日周一 下午3:09写道: > Hi, > > 可以排查下 GC 情况,频繁 FGC 也会导致这些情况。 > > Best, > jjiey > > > 2021年3月8日 14:37,yidan zhao <[hidden email]> 写道: > > > > > 如题,我有个任务频繁发生该异常然后重启。今天任务启动1h后,看了下WEB-UI的检查点也没,restored达到了8已经。然后Exception页面显示该错误,估计大多数都是因为该错误导致的restore。 > > 除此外,就是 ‘Job leader for job id eb5d2893c4c6f4034995b9c8e180f01e lost > > leadership’ 错导致任务重启。 > > > > 下面给出刚刚的一个错误日志(环境flink1.12,standalone集群,5JM+5TM,JM和TM混部在相同机器): > > 2021-03-08 14:31:40 > > org.apache.flink.runtime.io > .network.netty.exception.RemoteTransportException: > > Error at remote task manager '10.35.185.38/10.35.185.38:2016'. > > at org.apache.flink.runtime.io.network.netty. > > CreditBasedPartitionRequestClientHandler.decodeMsg( > > CreditBasedPartitionRequestClientHandler.java:294) > > at org.apache.flink.runtime.io.network.netty. > > CreditBasedPartitionRequestClientHandler.channelRead( > > CreditBasedPartitionRequestClientHandler.java:183) > > at org.apache.flink.shaded.netty4.io.netty.channel. > > AbstractChannelHandlerContext.invokeChannelRead( > > AbstractChannelHandlerContext.java:379) > > at org.apache.flink.shaded.netty4.io.netty.channel. > > AbstractChannelHandlerContext.invokeChannelRead( > > AbstractChannelHandlerContext.java:365) > > at org.apache.flink.shaded.netty4.io.netty.channel. > > > AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext > > .java:357) > > at org.apache.flink.runtime.io.network.netty. > > NettyMessageClientDecoderDelegate.channelRead( > > NettyMessageClientDecoderDelegate.java:115) > > at org.apache.flink.shaded.netty4.io.netty.channel. > > AbstractChannelHandlerContext.invokeChannelRead( > > AbstractChannelHandlerContext.java:379) > > at org.apache.flink.shaded.netty4.io.netty.channel. > > AbstractChannelHandlerContext.invokeChannelRead( > > AbstractChannelHandlerContext.java:365) > > at org.apache.flink.shaded.netty4.io.netty.channel. > > > AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext > > .java:357) > > at org.apache.flink.shaded.netty4.io.netty.channel. > > > DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java: > > 1410) > > at org.apache.flink.shaded.netty4.io.netty.channel. > > AbstractChannelHandlerContext.invokeChannelRead( > > AbstractChannelHandlerContext.java:379) > > at org.apache.flink.shaded.netty4.io.netty.channel. > > AbstractChannelHandlerContext.invokeChannelRead( > > AbstractChannelHandlerContext.java:365) > > at org.apache.flink.shaded.netty4.io.netty.channel. > > DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) > > at org.apache.flink.shaded.netty4.io.netty.channel.epoll. > > AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady( > > AbstractEpollStreamChannel.java:792) > > at > org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop > > .processReady(EpollEventLoop.java:475) > > at > org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop > > .run(EpollEventLoop.java:378) > > at org.apache.flink.shaded.netty4.io.netty.util.concurrent. > > SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) > > at org.apache.flink.shaded.netty4.io.netty.util.internal. > > ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > > at java.lang.Thread.run(Thread.java:748) > > Caused by: org.apache.flink.runtime.io.network.partition. > > ProducerFailedException: org.apache.flink.util.FlinkException: JobManager > > responsible for eb5d2893c4c6f4034995b9c8e180f01e lost the leadership. > > at org.apache.flink.runtime.io.network.netty.PartitionRequestQueue > > .writeAndFlushNextMessageIfPossible(PartitionRequestQueue.java:221) > > at org.apache.flink.runtime.io.network.netty.PartitionRequestQueue > > .enqueueAvailableReader(PartitionRequestQueue.java:108) > > at org.apache.flink.runtime.io.network.netty.PartitionRequestQueue > > .userEventTriggered(PartitionRequestQueue.java:170) > > at org.apache.flink.shaded.netty4.io.netty.channel. > > AbstractChannelHandlerContext.invokeUserEventTriggered( > > AbstractChannelHandlerContext.java:346) > > at org.apache.flink.shaded.netty4.io.netty.channel. > > AbstractChannelHandlerContext.invokeUserEventTriggered( > > AbstractChannelHandlerContext.java:332) > > at org.apache.flink.shaded.netty4.io.netty.channel. > > AbstractChannelHandlerContext.fireUserEventTriggered( > > AbstractChannelHandlerContext.java:324) > > at org.apache.flink.shaded.netty4.io.netty.channel. > > > ChannelInboundHandlerAdapter.userEventTriggered(ChannelInboundHandlerAdapter > > .java:117) > > at org.apache.flink.shaded.netty4.io.netty.handler.codec. > > ByteToMessageDecoder.userEventTriggered(ByteToMessageDecoder.java:365) > > at org.apache.flink.shaded.netty4.io.netty.channel. > > AbstractChannelHandlerContext.invokeUserEventTriggered( > > AbstractChannelHandlerContext.java:346) > > at org.apache.flink.shaded.netty4.io.netty.channel. > > AbstractChannelHandlerContext.invokeUserEventTriggered( > > AbstractChannelHandlerContext.java:332) > > at org.apache.flink.shaded.netty4.io.netty.channel. > > AbstractChannelHandlerContext.fireUserEventTriggered( > > AbstractChannelHandlerContext.java:324) > > at org.apache.flink.shaded.netty4.io.netty.channel. > > > DefaultChannelPipeline$HeadContext.userEventTriggered(DefaultChannelPipeline > > .java:1428) > > at org.apache.flink.shaded.netty4.io.netty.channel. > > AbstractChannelHandlerContext.invokeUserEventTriggered( > > AbstractChannelHandlerContext.java:346) > > at org.apache.flink.shaded.netty4.io.netty.channel. > > AbstractChannelHandlerContext.invokeUserEventTriggered( > > AbstractChannelHandlerContext.java:332) > > at org.apache.flink.shaded.netty4.io.netty.channel. > > > DefaultChannelPipeline.fireUserEventTriggered(DefaultChannelPipeline.java: > > 913) > > at org.apache.flink.runtime.io.network.netty.PartitionRequestQueue > > .lambda$notifyReaderNonEmpty$0(PartitionRequestQueue.java:87) > > at org.apache.flink.shaded.netty4.io.netty.util.concurrent. > > AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) > > at org.apache.flink.shaded.netty4.io.netty.util.concurrent. > > SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) > > at > org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop > > .run(EpollEventLoop.java:387) > > ... 3 more > > Caused by: org.apache.flink.util.FlinkException: JobManager responsible > for > > eb5d2893c4c6f4034995b9c8e180f01e lost the leadership. > > at org.apache.flink.runtime.taskexecutor.TaskExecutor > > .disconnectJobManagerConnection(TaskExecutor.java:1422) > > at org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1300( > > TaskExecutor.java:174) > > at org.apache.flink.runtime.taskexecutor. > > TaskExecutor$JobLeaderListenerImpl.lambda$null$2(TaskExecutor.java:1856) > > at java.util.Optional.ifPresent(Optional.java:159) > > at org.apache.flink.runtime.taskexecutor. > > TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$3( > > TaskExecutor.java:1855) > > at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync( > > AkkaRpcActor.java:404) > > at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage( > > AkkaRpcActor.java:197) > > at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage( > > AkkaRpcActor.java:154) > > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) > > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) > > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) > > at akka.japi.pf > .UnitCaseStatement.applyOrElse(CaseStatements.scala:21) > > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) > > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > > at akka.actor.Actor$class.aroundReceive(Actor.scala:517) > > at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) > > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) > > at akka.actor.ActorCell.invoke(ActorCell.scala:561) > > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) > > at akka.dispatch.Mailbox.run(Mailbox.scala:225) > > at akka.dispatch.Mailbox.exec(Mailbox.scala:235) > > at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > > at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool > > .java:1339) > > at > akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > > at > akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread > > .java:107) > > Caused by: java.lang.Exception: Job leader for job id > > eb5d2893c4c6f4034995b9c8e180f01e lost leadership. > > ... 24 more > > > > > > (1)zookeeper的超时设置的是60s,感觉网络异常zk超时不至于60s都不够。 > > (2)akka.ask.timeout: 60s > > taskmanager.network.request-backoff.max: 60000 > > akka此参数之前也调整为60s了。 > > > > 如上信息,希望社区同学们给点思路。 > > > > |
而且大家推荐怎么设置呢,我可能默认就G1了。不清楚G1是否也需要精调。
我目前设置的内存还是比较大的。(50G的,100G的TaskManager都有),这么大heap,是否需要特别设置啥呢? 或者是否有必要拆小,比如设置10Gheap,然后把taskmanager数量提上去。 yidan zhao <[hidden email]> 于2021年3月9日周二 下午2:56写道: > 好的,我会看下。 > 然后我今天发现我好多个集群GC collector不一样。 > 目前发现3种,默认的是G1。flink conf中配置了env.java.opts: > "-XX:-OmitStackTraceInFastThrow"的情况出现了2种,一种是Parallel GC with 83 > threads,还有一种是Mark Sweep Compact GC。 > 大佬们,Flink是根据内存大小有什么动态调整吗。 > > > 不使用G1我大概理解了,可能设置了java.opts这个是覆盖,不是追加。本身我只是希望设置下-XX:-OmitStackTraceInFastThrow而已。 > > > 杨杰 <[hidden email]> 于2021年3月8日周一 下午3:09写道: > >> Hi, >> >> 可以排查下 GC 情况,频繁 FGC 也会导致这些情况。 >> >> Best, >> jjiey >> >> > 2021年3月8日 14:37,yidan zhao <[hidden email]> 写道: >> > >> > >> 如题,我有个任务频繁发生该异常然后重启。今天任务启动1h后,看了下WEB-UI的检查点也没,restored达到了8已经。然后Exception页面显示该错误,估计大多数都是因为该错误导致的restore。 >> > 除此外,就是 ‘Job leader for job id eb5d2893c4c6f4034995b9c8e180f01e lost >> > leadership’ 错导致任务重启。 >> > >> > 下面给出刚刚的一个错误日志(环境flink1.12,standalone集群,5JM+5TM,JM和TM混部在相同机器): >> > 2021-03-08 14:31:40 >> > org.apache.flink.runtime.io >> .network.netty.exception.RemoteTransportException: >> > Error at remote task manager '10.35.185.38/10.35.185.38:2016'. >> > at org.apache.flink.runtime.io.network.netty. >> > CreditBasedPartitionRequestClientHandler.decodeMsg( >> > CreditBasedPartitionRequestClientHandler.java:294) >> > at org.apache.flink.runtime.io.network.netty. >> > CreditBasedPartitionRequestClientHandler.channelRead( >> > CreditBasedPartitionRequestClientHandler.java:183) >> > at org.apache.flink.shaded.netty4.io.netty.channel. >> > AbstractChannelHandlerContext.invokeChannelRead( >> > AbstractChannelHandlerContext.java:379) >> > at org.apache.flink.shaded.netty4.io.netty.channel. >> > AbstractChannelHandlerContext.invokeChannelRead( >> > AbstractChannelHandlerContext.java:365) >> > at org.apache.flink.shaded.netty4.io.netty.channel. >> > >> AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext >> > .java:357) >> > at org.apache.flink.runtime.io.network.netty. >> > NettyMessageClientDecoderDelegate.channelRead( >> > NettyMessageClientDecoderDelegate.java:115) >> > at org.apache.flink.shaded.netty4.io.netty.channel. >> > AbstractChannelHandlerContext.invokeChannelRead( >> > AbstractChannelHandlerContext.java:379) >> > at org.apache.flink.shaded.netty4.io.netty.channel. >> > AbstractChannelHandlerContext.invokeChannelRead( >> > AbstractChannelHandlerContext.java:365) >> > at org.apache.flink.shaded.netty4.io.netty.channel. >> > >> AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext >> > .java:357) >> > at org.apache.flink.shaded.netty4.io.netty.channel. >> > >> DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java: >> > 1410) >> > at org.apache.flink.shaded.netty4.io.netty.channel. >> > AbstractChannelHandlerContext.invokeChannelRead( >> > AbstractChannelHandlerContext.java:379) >> > at org.apache.flink.shaded.netty4.io.netty.channel. >> > AbstractChannelHandlerContext.invokeChannelRead( >> > AbstractChannelHandlerContext.java:365) >> > at org.apache.flink.shaded.netty4.io.netty.channel. >> > DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) >> > at org.apache.flink.shaded.netty4.io.netty.channel.epoll. >> > AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady( >> > AbstractEpollStreamChannel.java:792) >> > at >> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop >> > .processReady(EpollEventLoop.java:475) >> > at >> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop >> > .run(EpollEventLoop.java:378) >> > at org.apache.flink.shaded.netty4.io.netty.util.concurrent. >> > SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) >> > at org.apache.flink.shaded.netty4.io.netty.util.internal. >> > ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) >> > at java.lang.Thread.run(Thread.java:748) >> > Caused by: org.apache.flink.runtime.io.network.partition. >> > ProducerFailedException: org.apache.flink.util.FlinkException: >> JobManager >> > responsible for eb5d2893c4c6f4034995b9c8e180f01e lost the leadership. >> > at org.apache.flink.runtime.io.network.netty.PartitionRequestQueue >> > .writeAndFlushNextMessageIfPossible(PartitionRequestQueue.java:221) >> > at org.apache.flink.runtime.io.network.netty.PartitionRequestQueue >> > .enqueueAvailableReader(PartitionRequestQueue.java:108) >> > at org.apache.flink.runtime.io.network.netty.PartitionRequestQueue >> > .userEventTriggered(PartitionRequestQueue.java:170) >> > at org.apache.flink.shaded.netty4.io.netty.channel. >> > AbstractChannelHandlerContext.invokeUserEventTriggered( >> > AbstractChannelHandlerContext.java:346) >> > at org.apache.flink.shaded.netty4.io.netty.channel. >> > AbstractChannelHandlerContext.invokeUserEventTriggered( >> > AbstractChannelHandlerContext.java:332) >> > at org.apache.flink.shaded.netty4.io.netty.channel. >> > AbstractChannelHandlerContext.fireUserEventTriggered( >> > AbstractChannelHandlerContext.java:324) >> > at org.apache.flink.shaded.netty4.io.netty.channel. >> > >> ChannelInboundHandlerAdapter.userEventTriggered(ChannelInboundHandlerAdapter >> > .java:117) >> > at org.apache.flink.shaded.netty4.io.netty.handler.codec. >> > ByteToMessageDecoder.userEventTriggered(ByteToMessageDecoder.java:365) >> > at org.apache.flink.shaded.netty4.io.netty.channel. >> > AbstractChannelHandlerContext.invokeUserEventTriggered( >> > AbstractChannelHandlerContext.java:346) >> > at org.apache.flink.shaded.netty4.io.netty.channel. >> > AbstractChannelHandlerContext.invokeUserEventTriggered( >> > AbstractChannelHandlerContext.java:332) >> > at org.apache.flink.shaded.netty4.io.netty.channel. >> > AbstractChannelHandlerContext.fireUserEventTriggered( >> > AbstractChannelHandlerContext.java:324) >> > at org.apache.flink.shaded.netty4.io.netty.channel. >> > >> DefaultChannelPipeline$HeadContext.userEventTriggered(DefaultChannelPipeline >> > .java:1428) >> > at org.apache.flink.shaded.netty4.io.netty.channel. >> > AbstractChannelHandlerContext.invokeUserEventTriggered( >> > AbstractChannelHandlerContext.java:346) >> > at org.apache.flink.shaded.netty4.io.netty.channel. >> > AbstractChannelHandlerContext.invokeUserEventTriggered( >> > AbstractChannelHandlerContext.java:332) >> > at org.apache.flink.shaded.netty4.io.netty.channel. >> > >> DefaultChannelPipeline.fireUserEventTriggered(DefaultChannelPipeline.java: >> > 913) >> > at org.apache.flink.runtime.io.network.netty.PartitionRequestQueue >> > .lambda$notifyReaderNonEmpty$0(PartitionRequestQueue.java:87) >> > at org.apache.flink.shaded.netty4.io.netty.util.concurrent. >> > AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) >> > at org.apache.flink.shaded.netty4.io.netty.util.concurrent. >> > >> SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) >> > at >> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop >> > .run(EpollEventLoop.java:387) >> > ... 3 more >> > Caused by: org.apache.flink.util.FlinkException: JobManager responsible >> for >> > eb5d2893c4c6f4034995b9c8e180f01e lost the leadership. >> > at org.apache.flink.runtime.taskexecutor.TaskExecutor >> > .disconnectJobManagerConnection(TaskExecutor.java:1422) >> > at org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1300( >> > TaskExecutor.java:174) >> > at org.apache.flink.runtime.taskexecutor. >> > TaskExecutor$JobLeaderListenerImpl.lambda$null$2(TaskExecutor.java:1856) >> > at java.util.Optional.ifPresent(Optional.java:159) >> > at org.apache.flink.runtime.taskexecutor. >> > TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$3( >> > TaskExecutor.java:1855) >> > at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync( >> > AkkaRpcActor.java:404) >> > at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage( >> > AkkaRpcActor.java:197) >> > at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage( >> > AkkaRpcActor.java:154) >> > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) >> > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) >> > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) >> > at akka.japi.pf >> .UnitCaseStatement.applyOrElse(CaseStatements.scala:21) >> > at >> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) >> > at >> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) >> > at >> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) >> > at akka.actor.Actor$class.aroundReceive(Actor.scala:517) >> > at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) >> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) >> > at akka.actor.ActorCell.invoke(ActorCell.scala:561) >> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) >> > at akka.dispatch.Mailbox.run(Mailbox.scala:225) >> > at akka.dispatch.Mailbox.exec(Mailbox.scala:235) >> > at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >> > at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool >> > .java:1339) >> > at >> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >> > at >> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread >> > .java:107) >> > Caused by: java.lang.Exception: Job leader for job id >> > eb5d2893c4c6f4034995b9c8e180f01e lost leadership. >> > ... 24 more >> > >> > >> > (1)zookeeper的超时设置的是60s,感觉网络异常zk超时不至于60s都不够。 >> > (2)akka.ask.timeout: 60s >> > taskmanager.network.request-backoff.max: 60000 >> > akka此参数之前也调整为60s了。 >> > >> > 如上信息,希望社区同学们给点思路。 >> > >> >> |
看看当时的负载呢?有没有过高的情况,是什么原因。然后监控下网络和磁盘
在 2021-03-09 14:57:43,"yidan zhao" <[hidden email]> 写道: >而且大家推荐怎么设置呢,我可能默认就G1了。不清楚G1是否也需要精调。 >我目前设置的内存还是比较大的。(50G的,100G的TaskManager都有),这么大heap,是否需要特别设置啥呢? > >或者是否有必要拆小,比如设置10Gheap,然后把taskmanager数量提上去。 > >yidan zhao <[hidden email]> 于2021年3月9日周二 下午2:56写道: > >> 好的,我会看下。 >> 然后我今天发现我好多个集群GC collector不一样。 >> 目前发现3种,默认的是G1。flink conf中配置了env.java.opts: >> "-XX:-OmitStackTraceInFastThrow"的情况出现了2种,一种是Parallel GC with 83 >> threads,还有一种是Mark Sweep Compact GC。 >> 大佬们,Flink是根据内存大小有什么动态调整吗。 >> >> >> 不使用G1我大概理解了,可能设置了java.opts这个是覆盖,不是追加。本身我只是希望设置下-XX:-OmitStackTraceInFastThrow而已。 >> >> >> 杨杰 <[hidden email]> 于2021年3月8日周一 下午3:09写道: >> >>> Hi, >>> >>> 可以排查下 GC 情况,频繁 FGC 也会导致这些情况。 >>> >>> Best, >>> jjiey >>> >>> > 2021年3月8日 14:37,yidan zhao <[hidden email]> 写道: >>> > >>> > >>> 如题,我有个任务频繁发生该异常然后重启。今天任务启动1h后,看了下WEB-UI的检查点也没,restored达到了8已经。然后Exception页面显示该错误,估计大多数都是因为该错误导致的restore。 >>> > 除此外,就是 ‘Job leader for job id eb5d2893c4c6f4034995b9c8e180f01e lost >>> > leadership’ 错导致任务重启。 >>> > >>> > 下面给出刚刚的一个错误日志(环境flink1.12,standalone集群,5JM+5TM,JM和TM混部在相同机器): >>> > 2021-03-08 14:31:40 >>> > org.apache.flink.runtime.io >>> .network.netty.exception.RemoteTransportException: >>> > Error at remote task manager '10.35.185.38/10.35.185.38:2016'. >>> > at org.apache.flink.runtime.io.network.netty. >>> > CreditBasedPartitionRequestClientHandler.decodeMsg( >>> > CreditBasedPartitionRequestClientHandler.java:294) >>> > at org.apache.flink.runtime.io.network.netty. >>> > CreditBasedPartitionRequestClientHandler.channelRead( >>> > CreditBasedPartitionRequestClientHandler.java:183) >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> > AbstractChannelHandlerContext.invokeChannelRead( >>> > AbstractChannelHandlerContext.java:379) >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> > AbstractChannelHandlerContext.invokeChannelRead( >>> > AbstractChannelHandlerContext.java:365) >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> > >>> AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext >>> > .java:357) >>> > at org.apache.flink.runtime.io.network.netty. >>> > NettyMessageClientDecoderDelegate.channelRead( >>> > NettyMessageClientDecoderDelegate.java:115) >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> > AbstractChannelHandlerContext.invokeChannelRead( >>> > AbstractChannelHandlerContext.java:379) >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> > AbstractChannelHandlerContext.invokeChannelRead( >>> > AbstractChannelHandlerContext.java:365) >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> > >>> AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext >>> > .java:357) >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> > >>> DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java: >>> > 1410) >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> > AbstractChannelHandlerContext.invokeChannelRead( >>> > AbstractChannelHandlerContext.java:379) >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> > AbstractChannelHandlerContext.invokeChannelRead( >>> > AbstractChannelHandlerContext.java:365) >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> > DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) >>> > at org.apache.flink.shaded.netty4.io.netty.channel.epoll. >>> > AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady( >>> > AbstractEpollStreamChannel.java:792) >>> > at >>> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop >>> > .processReady(EpollEventLoop.java:475) >>> > at >>> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop >>> > .run(EpollEventLoop.java:378) >>> > at org.apache.flink.shaded.netty4.io.netty.util.concurrent. >>> > SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) >>> > at org.apache.flink.shaded.netty4.io.netty.util.internal. >>> > ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) >>> > at java.lang.Thread.run(Thread.java:748) >>> > Caused by: org.apache.flink.runtime.io.network.partition. >>> > ProducerFailedException: org.apache.flink.util.FlinkException: >>> JobManager >>> > responsible for eb5d2893c4c6f4034995b9c8e180f01e lost the leadership. >>> > at org.apache.flink.runtime.io.network.netty.PartitionRequestQueue >>> > .writeAndFlushNextMessageIfPossible(PartitionRequestQueue.java:221) >>> > at org.apache.flink.runtime.io.network.netty.PartitionRequestQueue >>> > .enqueueAvailableReader(PartitionRequestQueue.java:108) >>> > at org.apache.flink.runtime.io.network.netty.PartitionRequestQueue >>> > .userEventTriggered(PartitionRequestQueue.java:170) >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> > AbstractChannelHandlerContext.invokeUserEventTriggered( >>> > AbstractChannelHandlerContext.java:346) >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> > AbstractChannelHandlerContext.invokeUserEventTriggered( >>> > AbstractChannelHandlerContext.java:332) >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> > AbstractChannelHandlerContext.fireUserEventTriggered( >>> > AbstractChannelHandlerContext.java:324) >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> > >>> ChannelInboundHandlerAdapter.userEventTriggered(ChannelInboundHandlerAdapter >>> > .java:117) >>> > at org.apache.flink.shaded.netty4.io.netty.handler.codec. >>> > ByteToMessageDecoder.userEventTriggered(ByteToMessageDecoder.java:365) >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> > AbstractChannelHandlerContext.invokeUserEventTriggered( >>> > AbstractChannelHandlerContext.java:346) >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> > AbstractChannelHandlerContext.invokeUserEventTriggered( >>> > AbstractChannelHandlerContext.java:332) >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> > AbstractChannelHandlerContext.fireUserEventTriggered( >>> > AbstractChannelHandlerContext.java:324) >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> > >>> DefaultChannelPipeline$HeadContext.userEventTriggered(DefaultChannelPipeline >>> > .java:1428) >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> > AbstractChannelHandlerContext.invokeUserEventTriggered( >>> > AbstractChannelHandlerContext.java:346) >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> > AbstractChannelHandlerContext.invokeUserEventTriggered( >>> > AbstractChannelHandlerContext.java:332) >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> > >>> DefaultChannelPipeline.fireUserEventTriggered(DefaultChannelPipeline.java: >>> > 913) >>> > at org.apache.flink.runtime.io.network.netty.PartitionRequestQueue >>> > .lambda$notifyReaderNonEmpty$0(PartitionRequestQueue.java:87) >>> > at org.apache.flink.shaded.netty4.io.netty.util.concurrent. >>> > AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) >>> > at org.apache.flink.shaded.netty4.io.netty.util.concurrent. >>> > >>> SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) >>> > at >>> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop >>> > .run(EpollEventLoop.java:387) >>> > ... 3 more >>> > Caused by: org.apache.flink.util.FlinkException: JobManager responsible >>> for >>> > eb5d2893c4c6f4034995b9c8e180f01e lost the leadership. >>> > at org.apache.flink.runtime.taskexecutor.TaskExecutor >>> > .disconnectJobManagerConnection(TaskExecutor.java:1422) >>> > at org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1300( >>> > TaskExecutor.java:174) >>> > at org.apache.flink.runtime.taskexecutor. >>> > TaskExecutor$JobLeaderListenerImpl.lambda$null$2(TaskExecutor.java:1856) >>> > at java.util.Optional.ifPresent(Optional.java:159) >>> > at org.apache.flink.runtime.taskexecutor. >>> > TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$3( >>> > TaskExecutor.java:1855) >>> > at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync( >>> > AkkaRpcActor.java:404) >>> > at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage( >>> > AkkaRpcActor.java:197) >>> > at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage( >>> > AkkaRpcActor.java:154) >>> > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) >>> > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) >>> > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) >>> > at akka.japi.pf >>> .UnitCaseStatement.applyOrElse(CaseStatements.scala:21) >>> > at >>> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) >>> > at >>> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) >>> > at >>> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) >>> > at akka.actor.Actor$class.aroundReceive(Actor.scala:517) >>> > at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) >>> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) >>> > at akka.actor.ActorCell.invoke(ActorCell.scala:561) >>> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) >>> > at akka.dispatch.Mailbox.run(Mailbox.scala:225) >>> > at akka.dispatch.Mailbox.exec(Mailbox.scala:235) >>> > at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>> > at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool >>> > .java:1339) >>> > at >>> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >>> > at >>> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread >>> > .java:107) >>> > Caused by: java.lang.Exception: Job leader for job id >>> > eb5d2893c4c6f4034995b9c8e180f01e lost leadership. >>> > ... 24 more >>> > >>> > >>> > (1)zookeeper的超时设置的是60s,感觉网络异常zk超时不至于60s都不够。 >>> > (2)akka.ask.timeout: 60s >>> > taskmanager.network.request-backoff.max: 60000 >>> > akka此参数之前也调整为60s了。 >>> > >>> > 如上信息,希望社区同学们给点思路。 >>> > >>> >>> |
观察了下。CPU什么的有尖刺,但是也算基本正常,因为我的任务就是5分钟一波。基本每5分钟都有个尖刺。
然后目前通过Flink的web-ui看了下gc情况。 发现部分集群的fgc的确有问题,fgc平均大概达到10-20s,当然只有平均值,不清楚是否有某些gc时间更长情况。总难题来说10-20s的确是比较长的,这个我之后会去看看改进下。 (1)不过不清楚这个是否和这个问题直接相关,因为20s的卡顿是否足以引起该问题呢? (2)此外,大家推荐个内存设置,比如你们都多少TM,每个TM多少内存,跑的任务多大数据量大概。 我目前5个TM的集群,单TM100G内存,跑任务大概10w qps的入口流量,但是很大部分呢会过滤掉,后续部分流量较少。此外,检查点大概达到3-4GB。 Michael Ran <[hidden email]> 于2021年3月9日周二 下午4:27写道: > 看看当时的负载呢?有没有过高的情况,是什么原因。然后监控下网络和磁盘 > 在 2021-03-09 14:57:43,"yidan zhao" <[hidden email]> 写道: > >而且大家推荐怎么设置呢,我可能默认就G1了。不清楚G1是否也需要精调。 > >我目前设置的内存还是比较大的。(50G的,100G的TaskManager都有),这么大heap,是否需要特别设置啥呢? > > > >或者是否有必要拆小,比如设置10Gheap,然后把taskmanager数量提上去。 > > > >yidan zhao <[hidden email]> 于2021年3月9日周二 下午2:56写道: > > > >> 好的,我会看下。 > >> 然后我今天发现我好多个集群GC collector不一样。 > >> 目前发现3种,默认的是G1。flink conf中配置了env.java.opts: > >> "-XX:-OmitStackTraceInFastThrow"的情况出现了2种,一种是Parallel GC with 83 > >> threads,还有一种是Mark Sweep Compact GC。 > >> 大佬们,Flink是根据内存大小有什么动态调整吗。 > >> > >> > >> > 不使用G1我大概理解了,可能设置了java.opts这个是覆盖,不是追加。本身我只是希望设置下-XX:-OmitStackTraceInFastThrow而已。 > >> > >> > >> 杨杰 <[hidden email]> 于2021年3月8日周一 下午3:09写道: > >> > >>> Hi, > >>> > >>> 可以排查下 GC 情况,频繁 FGC 也会导致这些情况。 > >>> > >>> Best, > >>> jjiey > >>> > >>> > 2021年3月8日 14:37,yidan zhao <[hidden email]> 写道: > >>> > > >>> > > >>> > 如题,我有个任务频繁发生该异常然后重启。今天任务启动1h后,看了下WEB-UI的检查点也没,restored达到了8已经。然后Exception页面显示该错误,估计大多数都是因为该错误导致的restore。 > >>> > 除此外,就是 ‘Job leader for job id eb5d2893c4c6f4034995b9c8e180f01e lost > >>> > leadership’ 错导致任务重启。 > >>> > > >>> > 下面给出刚刚的一个错误日志(环境flink1.12,standalone集群,5JM+5TM,JM和TM混部在相同机器): > >>> > 2021-03-08 14:31:40 > >>> > org.apache.flink.runtime.io > >>> .network.netty.exception.RemoteTransportException: > >>> > Error at remote task manager '10.35.185.38/10.35.185.38:2016'. > >>> > at org.apache.flink.runtime.io.network.netty. > >>> > CreditBasedPartitionRequestClientHandler.decodeMsg( > >>> > CreditBasedPartitionRequestClientHandler.java:294) > >>> > at org.apache.flink.runtime.io.network.netty. > >>> > CreditBasedPartitionRequestClientHandler.channelRead( > >>> > CreditBasedPartitionRequestClientHandler.java:183) > >>> > at org.apache.flink.shaded.netty4.io.netty.channel. > >>> > AbstractChannelHandlerContext.invokeChannelRead( > >>> > AbstractChannelHandlerContext.java:379) > >>> > at org.apache.flink.shaded.netty4.io.netty.channel. > >>> > AbstractChannelHandlerContext.invokeChannelRead( > >>> > AbstractChannelHandlerContext.java:365) > >>> > at org.apache.flink.shaded.netty4.io.netty.channel. > >>> > > >>> > AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext > >>> > .java:357) > >>> > at org.apache.flink.runtime.io.network.netty. > >>> > NettyMessageClientDecoderDelegate.channelRead( > >>> > NettyMessageClientDecoderDelegate.java:115) > >>> > at org.apache.flink.shaded.netty4.io.netty.channel. > >>> > AbstractChannelHandlerContext.invokeChannelRead( > >>> > AbstractChannelHandlerContext.java:379) > >>> > at org.apache.flink.shaded.netty4.io.netty.channel. > >>> > AbstractChannelHandlerContext.invokeChannelRead( > >>> > AbstractChannelHandlerContext.java:365) > >>> > at org.apache.flink.shaded.netty4.io.netty.channel. > >>> > > >>> > AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext > >>> > .java:357) > >>> > at org.apache.flink.shaded.netty4.io.netty.channel. > >>> > > >>> > DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java: > >>> > 1410) > >>> > at org.apache.flink.shaded.netty4.io.netty.channel. > >>> > AbstractChannelHandlerContext.invokeChannelRead( > >>> > AbstractChannelHandlerContext.java:379) > >>> > at org.apache.flink.shaded.netty4.io.netty.channel. > >>> > AbstractChannelHandlerContext.invokeChannelRead( > >>> > AbstractChannelHandlerContext.java:365) > >>> > at org.apache.flink.shaded.netty4.io.netty.channel. > >>> > > DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) > >>> > at org.apache.flink.shaded.netty4.io.netty.channel.epoll. > >>> > AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady( > >>> > AbstractEpollStreamChannel.java:792) > >>> > at > >>> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop > >>> > .processReady(EpollEventLoop.java:475) > >>> > at > >>> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop > >>> > .run(EpollEventLoop.java:378) > >>> > at org.apache.flink.shaded.netty4.io.netty.util.concurrent. > >>> > SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) > >>> > at org.apache.flink.shaded.netty4.io.netty.util.internal. > >>> > ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > >>> > at java.lang.Thread.run(Thread.java:748) > >>> > Caused by: org.apache.flink.runtime.io.network.partition. > >>> > ProducerFailedException: org.apache.flink.util.FlinkException: > >>> JobManager > >>> > responsible for eb5d2893c4c6f4034995b9c8e180f01e lost the leadership. > >>> > at org.apache.flink.runtime.io > .network.netty.PartitionRequestQueue > >>> > .writeAndFlushNextMessageIfPossible(PartitionRequestQueue.java:221) > >>> > at org.apache.flink.runtime.io > .network.netty.PartitionRequestQueue > >>> > .enqueueAvailableReader(PartitionRequestQueue.java:108) > >>> > at org.apache.flink.runtime.io > .network.netty.PartitionRequestQueue > >>> > .userEventTriggered(PartitionRequestQueue.java:170) > >>> > at org.apache.flink.shaded.netty4.io.netty.channel. > >>> > AbstractChannelHandlerContext.invokeUserEventTriggered( > >>> > AbstractChannelHandlerContext.java:346) > >>> > at org.apache.flink.shaded.netty4.io.netty.channel. > >>> > AbstractChannelHandlerContext.invokeUserEventTriggered( > >>> > AbstractChannelHandlerContext.java:332) > >>> > at org.apache.flink.shaded.netty4.io.netty.channel. > >>> > AbstractChannelHandlerContext.fireUserEventTriggered( > >>> > AbstractChannelHandlerContext.java:324) > >>> > at org.apache.flink.shaded.netty4.io.netty.channel. > >>> > > >>> > ChannelInboundHandlerAdapter.userEventTriggered(ChannelInboundHandlerAdapter > >>> > .java:117) > >>> > at org.apache.flink.shaded.netty4.io.netty.handler.codec. > >>> > > ByteToMessageDecoder.userEventTriggered(ByteToMessageDecoder.java:365) > >>> > at org.apache.flink.shaded.netty4.io.netty.channel. > >>> > AbstractChannelHandlerContext.invokeUserEventTriggered( > >>> > AbstractChannelHandlerContext.java:346) > >>> > at org.apache.flink.shaded.netty4.io.netty.channel. > >>> > AbstractChannelHandlerContext.invokeUserEventTriggered( > >>> > AbstractChannelHandlerContext.java:332) > >>> > at org.apache.flink.shaded.netty4.io.netty.channel. > >>> > AbstractChannelHandlerContext.fireUserEventTriggered( > >>> > AbstractChannelHandlerContext.java:324) > >>> > at org.apache.flink.shaded.netty4.io.netty.channel. > >>> > > >>> > DefaultChannelPipeline$HeadContext.userEventTriggered(DefaultChannelPipeline > >>> > .java:1428) > >>> > at org.apache.flink.shaded.netty4.io.netty.channel. > >>> > AbstractChannelHandlerContext.invokeUserEventTriggered( > >>> > AbstractChannelHandlerContext.java:346) > >>> > at org.apache.flink.shaded.netty4.io.netty.channel. > >>> > AbstractChannelHandlerContext.invokeUserEventTriggered( > >>> > AbstractChannelHandlerContext.java:332) > >>> > at org.apache.flink.shaded.netty4.io.netty.channel. > >>> > > >>> > DefaultChannelPipeline.fireUserEventTriggered(DefaultChannelPipeline.java: > >>> > 913) > >>> > at org.apache.flink.runtime.io > .network.netty.PartitionRequestQueue > >>> > .lambda$notifyReaderNonEmpty$0(PartitionRequestQueue.java:87) > >>> > at org.apache.flink.shaded.netty4.io.netty.util.concurrent. > >>> > AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) > >>> > at org.apache.flink.shaded.netty4.io.netty.util.concurrent. > >>> > > >>> > SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) > >>> > at > >>> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop > >>> > .run(EpollEventLoop.java:387) > >>> > ... 3 more > >>> > Caused by: org.apache.flink.util.FlinkException: JobManager > responsible > >>> for > >>> > eb5d2893c4c6f4034995b9c8e180f01e lost the leadership. > >>> > at org.apache.flink.runtime.taskexecutor.TaskExecutor > >>> > .disconnectJobManagerConnection(TaskExecutor.java:1422) > >>> > at org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1300( > >>> > TaskExecutor.java:174) > >>> > at org.apache.flink.runtime.taskexecutor. > >>> > > TaskExecutor$JobLeaderListenerImpl.lambda$null$2(TaskExecutor.java:1856) > >>> > at java.util.Optional.ifPresent(Optional.java:159) > >>> > at org.apache.flink.runtime.taskexecutor. > >>> > TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$3( > >>> > TaskExecutor.java:1855) > >>> > at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync( > >>> > AkkaRpcActor.java:404) > >>> > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage( > >>> > AkkaRpcActor.java:197) > >>> > at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage( > >>> > AkkaRpcActor.java:154) > >>> > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) > >>> > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) > >>> > at > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) > >>> > at akka.japi.pf > >>> .UnitCaseStatement.applyOrElse(CaseStatements.scala:21) > >>> > at > >>> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) > >>> > at > >>> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > >>> > at > >>> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > >>> > at akka.actor.Actor$class.aroundReceive(Actor.scala:517) > >>> > at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) > >>> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) > >>> > at akka.actor.ActorCell.invoke(ActorCell.scala:561) > >>> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) > >>> > at akka.dispatch.Mailbox.run(Mailbox.scala:225) > >>> > at akka.dispatch.Mailbox.exec(Mailbox.scala:235) > >>> > at > akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > >>> > at > akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool > >>> > .java:1339) > >>> > at > >>> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > >>> > at > >>> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread > >>> > .java:107) > >>> > Caused by: java.lang.Exception: Job leader for job id > >>> > eb5d2893c4c6f4034995b9c8e180f01e lost leadership. > >>> > ... 24 more > >>> > > >>> > > >>> > (1)zookeeper的超时设置的是60s,感觉网络异常zk超时不至于60s都不够。 > >>> > (2)akka.ask.timeout: 60s > >>> > taskmanager.network.request-backoff.max: 60000 > >>> > akka此参数之前也调整为60s了。 > >>> > > >>> > 如上信息,希望社区同学们给点思路。 > >>> > > >>> > >>> > |
补充,还有就是GC收集器,是否无脑使用G1就可以呢?我之前一直是G1,只是最近修改了opts不小心换成其他了。本意不是为了换GC收集器的。
yidan zhao <[hidden email]> 于2021年3月9日周二 下午7:26写道: > 观察了下。CPU什么的有尖刺,但是也算基本正常,因为我的任务就是5分钟一波。基本每5分钟都有个尖刺。 > 然后目前通过Flink的web-ui看了下gc情况。 > 发现部分集群的fgc的确有问题,fgc平均大概达到10-20s,当然只有平均值,不清楚是否有某些gc时间更长情况。总难题来说10-20s的确是比较长的,这个我之后会去看看改进下。 > > (1)不过不清楚这个是否和这个问题直接相关,因为20s的卡顿是否足以引起该问题呢? > (2)此外,大家推荐个内存设置,比如你们都多少TM,每个TM多少内存,跑的任务多大数据量大概。 > 我目前5个TM的集群,单TM100G内存,跑任务大概10w > qps的入口流量,但是很大部分呢会过滤掉,后续部分流量较少。此外,检查点大概达到3-4GB。 > > > Michael Ran <[hidden email]> 于2021年3月9日周二 下午4:27写道: > >> 看看当时的负载呢?有没有过高的情况,是什么原因。然后监控下网络和磁盘 >> 在 2021-03-09 14:57:43,"yidan zhao" <[hidden email]> 写道: >> >而且大家推荐怎么设置呢,我可能默认就G1了。不清楚G1是否也需要精调。 >> >我目前设置的内存还是比较大的。(50G的,100G的TaskManager都有),这么大heap,是否需要特别设置啥呢? >> > >> >或者是否有必要拆小,比如设置10Gheap,然后把taskmanager数量提上去。 >> > >> >yidan zhao <[hidden email]> 于2021年3月9日周二 下午2:56写道: >> > >> >> 好的,我会看下。 >> >> 然后我今天发现我好多个集群GC collector不一样。 >> >> 目前发现3种,默认的是G1。flink conf中配置了env.java.opts: >> >> "-XX:-OmitStackTraceInFastThrow"的情况出现了2种,一种是Parallel GC with 83 >> >> threads,还有一种是Mark Sweep Compact GC。 >> >> 大佬们,Flink是根据内存大小有什么动态调整吗。 >> >> >> >> >> >> >> 不使用G1我大概理解了,可能设置了java.opts这个是覆盖,不是追加。本身我只是希望设置下-XX:-OmitStackTraceInFastThrow而已。 >> >> >> >> >> >> 杨杰 <[hidden email]> 于2021年3月8日周一 下午3:09写道: >> >> >> >>> Hi, >> >>> >> >>> 可以排查下 GC 情况,频繁 FGC 也会导致这些情况。 >> >>> >> >>> Best, >> >>> jjiey >> >>> >> >>> > 2021年3月8日 14:37,yidan zhao <[hidden email]> 写道: >> >>> > >> >>> > >> >>> >> 如题,我有个任务频繁发生该异常然后重启。今天任务启动1h后,看了下WEB-UI的检查点也没,restored达到了8已经。然后Exception页面显示该错误,估计大多数都是因为该错误导致的restore。 >> >>> > 除此外,就是 ‘Job leader for job id eb5d2893c4c6f4034995b9c8e180f01e lost >> >>> > leadership’ 错导致任务重启。 >> >>> > >> >>> > 下面给出刚刚的一个错误日志(环境flink1.12,standalone集群,5JM+5TM,JM和TM混部在相同机器): >> >>> > 2021-03-08 14:31:40 >> >>> > org.apache.flink.runtime.io >> >>> .network.netty.exception.RemoteTransportException: >> >>> > Error at remote task manager '10.35.185.38/10.35.185.38:2016'. >> >>> > at org.apache.flink.runtime.io.network.netty. >> >>> > CreditBasedPartitionRequestClientHandler.decodeMsg( >> >>> > CreditBasedPartitionRequestClientHandler.java:294) >> >>> > at org.apache.flink.runtime.io.network.netty. >> >>> > CreditBasedPartitionRequestClientHandler.channelRead( >> >>> > CreditBasedPartitionRequestClientHandler.java:183) >> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >> >>> > AbstractChannelHandlerContext.invokeChannelRead( >> >>> > AbstractChannelHandlerContext.java:379) >> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >> >>> > AbstractChannelHandlerContext.invokeChannelRead( >> >>> > AbstractChannelHandlerContext.java:365) >> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >> >>> > >> >>> >> AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext >> >>> > .java:357) >> >>> > at org.apache.flink.runtime.io.network.netty. >> >>> > NettyMessageClientDecoderDelegate.channelRead( >> >>> > NettyMessageClientDecoderDelegate.java:115) >> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >> >>> > AbstractChannelHandlerContext.invokeChannelRead( >> >>> > AbstractChannelHandlerContext.java:379) >> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >> >>> > AbstractChannelHandlerContext.invokeChannelRead( >> >>> > AbstractChannelHandlerContext.java:365) >> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >> >>> > >> >>> >> AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext >> >>> > .java:357) >> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >> >>> > >> >>> >> DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java: >> >>> > 1410) >> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >> >>> > AbstractChannelHandlerContext.invokeChannelRead( >> >>> > AbstractChannelHandlerContext.java:379) >> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >> >>> > AbstractChannelHandlerContext.invokeChannelRead( >> >>> > AbstractChannelHandlerContext.java:365) >> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >> >>> > >> DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) >> >>> > at org.apache.flink.shaded.netty4.io.netty.channel.epoll. >> >>> > AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady( >> >>> > AbstractEpollStreamChannel.java:792) >> >>> > at >> >>> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop >> >>> > .processReady(EpollEventLoop.java:475) >> >>> > at >> >>> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop >> >>> > .run(EpollEventLoop.java:378) >> >>> > at org.apache.flink.shaded.netty4.io.netty.util.concurrent. >> >>> > SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) >> >>> > at org.apache.flink.shaded.netty4.io.netty.util.internal. >> >>> > ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) >> >>> > at java.lang.Thread.run(Thread.java:748) >> >>> > Caused by: org.apache.flink.runtime.io.network.partition. >> >>> > ProducerFailedException: org.apache.flink.util.FlinkException: >> >>> JobManager >> >>> > responsible for eb5d2893c4c6f4034995b9c8e180f01e lost the >> leadership. >> >>> > at org.apache.flink.runtime.io >> .network.netty.PartitionRequestQueue >> >>> > .writeAndFlushNextMessageIfPossible(PartitionRequestQueue.java:221) >> >>> > at org.apache.flink.runtime.io >> .network.netty.PartitionRequestQueue >> >>> > .enqueueAvailableReader(PartitionRequestQueue.java:108) >> >>> > at org.apache.flink.runtime.io >> .network.netty.PartitionRequestQueue >> >>> > .userEventTriggered(PartitionRequestQueue.java:170) >> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >> >>> > AbstractChannelHandlerContext.invokeUserEventTriggered( >> >>> > AbstractChannelHandlerContext.java:346) >> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >> >>> > AbstractChannelHandlerContext.invokeUserEventTriggered( >> >>> > AbstractChannelHandlerContext.java:332) >> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >> >>> > AbstractChannelHandlerContext.fireUserEventTriggered( >> >>> > AbstractChannelHandlerContext.java:324) >> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >> >>> > >> >>> >> ChannelInboundHandlerAdapter.userEventTriggered(ChannelInboundHandlerAdapter >> >>> > .java:117) >> >>> > at org.apache.flink.shaded.netty4.io.netty.handler.codec. >> >>> > >> ByteToMessageDecoder.userEventTriggered(ByteToMessageDecoder.java:365) >> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >> >>> > AbstractChannelHandlerContext.invokeUserEventTriggered( >> >>> > AbstractChannelHandlerContext.java:346) >> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >> >>> > AbstractChannelHandlerContext.invokeUserEventTriggered( >> >>> > AbstractChannelHandlerContext.java:332) >> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >> >>> > AbstractChannelHandlerContext.fireUserEventTriggered( >> >>> > AbstractChannelHandlerContext.java:324) >> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >> >>> > >> >>> >> DefaultChannelPipeline$HeadContext.userEventTriggered(DefaultChannelPipeline >> >>> > .java:1428) >> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >> >>> > AbstractChannelHandlerContext.invokeUserEventTriggered( >> >>> > AbstractChannelHandlerContext.java:346) >> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >> >>> > AbstractChannelHandlerContext.invokeUserEventTriggered( >> >>> > AbstractChannelHandlerContext.java:332) >> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >> >>> > >> >>> >> DefaultChannelPipeline.fireUserEventTriggered(DefaultChannelPipeline.java: >> >>> > 913) >> >>> > at org.apache.flink.runtime.io >> .network.netty.PartitionRequestQueue >> >>> > .lambda$notifyReaderNonEmpty$0(PartitionRequestQueue.java:87) >> >>> > at org.apache.flink.shaded.netty4.io.netty.util.concurrent. >> >>> > AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) >> >>> > at org.apache.flink.shaded.netty4.io.netty.util.concurrent. >> >>> > >> >>> >> SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) >> >>> > at >> >>> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop >> >>> > .run(EpollEventLoop.java:387) >> >>> > ... 3 more >> >>> > Caused by: org.apache.flink.util.FlinkException: JobManager >> responsible >> >>> for >> >>> > eb5d2893c4c6f4034995b9c8e180f01e lost the leadership. >> >>> > at org.apache.flink.runtime.taskexecutor.TaskExecutor >> >>> > .disconnectJobManagerConnection(TaskExecutor.java:1422) >> >>> > at >> org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1300( >> >>> > TaskExecutor.java:174) >> >>> > at org.apache.flink.runtime.taskexecutor. >> >>> > >> TaskExecutor$JobLeaderListenerImpl.lambda$null$2(TaskExecutor.java:1856) >> >>> > at java.util.Optional.ifPresent(Optional.java:159) >> >>> > at org.apache.flink.runtime.taskexecutor. >> >>> > >> TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$3( >> >>> > TaskExecutor.java:1855) >> >>> > at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync( >> >>> > AkkaRpcActor.java:404) >> >>> > at >> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage( >> >>> > AkkaRpcActor.java:197) >> >>> > at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage( >> >>> > AkkaRpcActor.java:154) >> >>> > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) >> >>> > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) >> >>> > at >> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) >> >>> > at akka.japi.pf >> >>> .UnitCaseStatement.applyOrElse(CaseStatements.scala:21) >> >>> > at >> >>> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) >> >>> > at >> >>> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) >> >>> > at >> >>> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) >> >>> > at akka.actor.Actor$class.aroundReceive(Actor.scala:517) >> >>> > at >> akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) >> >>> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) >> >>> > at akka.actor.ActorCell.invoke(ActorCell.scala:561) >> >>> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) >> >>> > at akka.dispatch.Mailbox.run(Mailbox.scala:225) >> >>> > at akka.dispatch.Mailbox.exec(Mailbox.scala:235) >> >>> > at >> akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >> >>> > at >> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool >> >>> > .java:1339) >> >>> > at >> >>> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >> >>> > at >> >>> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread >> >>> > .java:107) >> >>> > Caused by: java.lang.Exception: Job leader for job id >> >>> > eb5d2893c4c6f4034995b9c8e180f01e lost leadership. >> >>> > ... 24 more >> >>> > >> >>> > >> >>> > (1)zookeeper的超时设置的是60s,感觉网络异常zk超时不至于60s都不够。 >> >>> > (2)akka.ask.timeout: 60s >> >>> > taskmanager.network.request-backoff.max: 60000 >> >>> > akka此参数之前也调整为60s了。 >> >>> > >> >>> > 如上信息,希望社区同学们给点思路。 >> >>> > >> >>> >> >>> >> > |
今天对比了下G1和copy-MarkSweepCompact的效果。
运行相同时间, 相同任务。 G1的GC时长更长,但是次数更多,因为每次GC的时间更短。 1h15min时间,G1的gc 1100+次,平均每次1s左右。 后者gc 205次,平均每次1.9s左右。 yidan zhao <[hidden email]> 于2021年3月9日周二 下午7:30写道: > 补充,还有就是GC收集器,是否无脑使用G1就可以呢?我之前一直是G1,只是最近修改了opts不小心换成其他了。本意不是为了换GC收集器的。 > > yidan zhao <[hidden email]> 于2021年3月9日周二 下午7:26写道: > >> 观察了下。CPU什么的有尖刺,但是也算基本正常,因为我的任务就是5分钟一波。基本每5分钟都有个尖刺。 >> 然后目前通过Flink的web-ui看了下gc情况。 >> 发现部分集群的fgc的确有问题,fgc平均大概达到10-20s,当然只有平均值,不清楚是否有某些gc时间更长情况。总难题来说10-20s的确是比较长的,这个我之后会去看看改进下。 >> >> (1)不过不清楚这个是否和这个问题直接相关,因为20s的卡顿是否足以引起该问题呢? >> (2)此外,大家推荐个内存设置,比如你们都多少TM,每个TM多少内存,跑的任务多大数据量大概。 >> 我目前5个TM的集群,单TM100G内存,跑任务大概10w >> qps的入口流量,但是很大部分呢会过滤掉,后续部分流量较少。此外,检查点大概达到3-4GB。 >> >> >> Michael Ran <[hidden email]> 于2021年3月9日周二 下午4:27写道: >> >>> 看看当时的负载呢?有没有过高的情况,是什么原因。然后监控下网络和磁盘 >>> 在 2021-03-09 14:57:43,"yidan zhao" <[hidden email]> 写道: >>> >而且大家推荐怎么设置呢,我可能默认就G1了。不清楚G1是否也需要精调。 >>> >我目前设置的内存还是比较大的。(50G的,100G的TaskManager都有),这么大heap,是否需要特别设置啥呢? >>> > >>> >或者是否有必要拆小,比如设置10Gheap,然后把taskmanager数量提上去。 >>> > >>> >yidan zhao <[hidden email]> 于2021年3月9日周二 下午2:56写道: >>> > >>> >> 好的,我会看下。 >>> >> 然后我今天发现我好多个集群GC collector不一样。 >>> >> 目前发现3种,默认的是G1。flink conf中配置了env.java.opts: >>> >> "-XX:-OmitStackTraceInFastThrow"的情况出现了2种,一种是Parallel GC with 83 >>> >> threads,还有一种是Mark Sweep Compact GC。 >>> >> 大佬们,Flink是根据内存大小有什么动态调整吗。 >>> >> >>> >> >>> >> >>> 不使用G1我大概理解了,可能设置了java.opts这个是覆盖,不是追加。本身我只是希望设置下-XX:-OmitStackTraceInFastThrow而已。 >>> >> >>> >> >>> >> 杨杰 <[hidden email]> 于2021年3月8日周一 下午3:09写道: >>> >> >>> >>> Hi, >>> >>> >>> >>> 可以排查下 GC 情况,频繁 FGC 也会导致这些情况。 >>> >>> >>> >>> Best, >>> >>> jjiey >>> >>> >>> >>> > 2021年3月8日 14:37,yidan zhao <[hidden email]> 写道: >>> >>> > >>> >>> > >>> >>> >>> 如题,我有个任务频繁发生该异常然后重启。今天任务启动1h后,看了下WEB-UI的检查点也没,restored达到了8已经。然后Exception页面显示该错误,估计大多数都是因为该错误导致的restore。 >>> >>> > 除此外,就是 ‘Job leader for job id eb5d2893c4c6f4034995b9c8e180f01e lost >>> >>> > leadership’ 错导致任务重启。 >>> >>> > >>> >>> > 下面给出刚刚的一个错误日志(环境flink1.12,standalone集群,5JM+5TM,JM和TM混部在相同机器): >>> >>> > 2021-03-08 14:31:40 >>> >>> > org.apache.flink.runtime.io >>> >>> .network.netty.exception.RemoteTransportException: >>> >>> > Error at remote task manager '10.35.185.38/10.35.185.38:2016'. >>> >>> > at org.apache.flink.runtime.io.network.netty. >>> >>> > CreditBasedPartitionRequestClientHandler.decodeMsg( >>> >>> > CreditBasedPartitionRequestClientHandler.java:294) >>> >>> > at org.apache.flink.runtime.io.network.netty. >>> >>> > CreditBasedPartitionRequestClientHandler.channelRead( >>> >>> > CreditBasedPartitionRequestClientHandler.java:183) >>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> >>> > AbstractChannelHandlerContext.invokeChannelRead( >>> >>> > AbstractChannelHandlerContext.java:379) >>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> >>> > AbstractChannelHandlerContext.invokeChannelRead( >>> >>> > AbstractChannelHandlerContext.java:365) >>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> >>> > >>> >>> >>> AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext >>> >>> > .java:357) >>> >>> > at org.apache.flink.runtime.io.network.netty. >>> >>> > NettyMessageClientDecoderDelegate.channelRead( >>> >>> > NettyMessageClientDecoderDelegate.java:115) >>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> >>> > AbstractChannelHandlerContext.invokeChannelRead( >>> >>> > AbstractChannelHandlerContext.java:379) >>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> >>> > AbstractChannelHandlerContext.invokeChannelRead( >>> >>> > AbstractChannelHandlerContext.java:365) >>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> >>> > >>> >>> >>> AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext >>> >>> > .java:357) >>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> >>> > >>> >>> >>> DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java: >>> >>> > 1410) >>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> >>> > AbstractChannelHandlerContext.invokeChannelRead( >>> >>> > AbstractChannelHandlerContext.java:379) >>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> >>> > AbstractChannelHandlerContext.invokeChannelRead( >>> >>> > AbstractChannelHandlerContext.java:365) >>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> >>> > >>> DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) >>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel.epoll. >>> >>> > AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady( >>> >>> > AbstractEpollStreamChannel.java:792) >>> >>> > at >>> >>> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop >>> >>> > .processReady(EpollEventLoop.java:475) >>> >>> > at >>> >>> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop >>> >>> > .run(EpollEventLoop.java:378) >>> >>> > at org.apache.flink.shaded.netty4.io.netty.util.concurrent. >>> >>> > SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) >>> >>> > at org.apache.flink.shaded.netty4.io.netty.util.internal. >>> >>> > ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) >>> >>> > at java.lang.Thread.run(Thread.java:748) >>> >>> > Caused by: org.apache.flink.runtime.io.network.partition. >>> >>> > ProducerFailedException: org.apache.flink.util.FlinkException: >>> >>> JobManager >>> >>> > responsible for eb5d2893c4c6f4034995b9c8e180f01e lost the >>> leadership. >>> >>> > at org.apache.flink.runtime.io >>> .network.netty.PartitionRequestQueue >>> >>> > .writeAndFlushNextMessageIfPossible(PartitionRequestQueue.java:221) >>> >>> > at org.apache.flink.runtime.io >>> .network.netty.PartitionRequestQueue >>> >>> > .enqueueAvailableReader(PartitionRequestQueue.java:108) >>> >>> > at org.apache.flink.runtime.io >>> .network.netty.PartitionRequestQueue >>> >>> > .userEventTriggered(PartitionRequestQueue.java:170) >>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> >>> > AbstractChannelHandlerContext.invokeUserEventTriggered( >>> >>> > AbstractChannelHandlerContext.java:346) >>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> >>> > AbstractChannelHandlerContext.invokeUserEventTriggered( >>> >>> > AbstractChannelHandlerContext.java:332) >>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> >>> > AbstractChannelHandlerContext.fireUserEventTriggered( >>> >>> > AbstractChannelHandlerContext.java:324) >>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> >>> > >>> >>> >>> ChannelInboundHandlerAdapter.userEventTriggered(ChannelInboundHandlerAdapter >>> >>> > .java:117) >>> >>> > at org.apache.flink.shaded.netty4.io.netty.handler.codec. >>> >>> > >>> ByteToMessageDecoder.userEventTriggered(ByteToMessageDecoder.java:365) >>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> >>> > AbstractChannelHandlerContext.invokeUserEventTriggered( >>> >>> > AbstractChannelHandlerContext.java:346) >>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> >>> > AbstractChannelHandlerContext.invokeUserEventTriggered( >>> >>> > AbstractChannelHandlerContext.java:332) >>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> >>> > AbstractChannelHandlerContext.fireUserEventTriggered( >>> >>> > AbstractChannelHandlerContext.java:324) >>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> >>> > >>> >>> >>> DefaultChannelPipeline$HeadContext.userEventTriggered(DefaultChannelPipeline >>> >>> > .java:1428) >>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> >>> > AbstractChannelHandlerContext.invokeUserEventTriggered( >>> >>> > AbstractChannelHandlerContext.java:346) >>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> >>> > AbstractChannelHandlerContext.invokeUserEventTriggered( >>> >>> > AbstractChannelHandlerContext.java:332) >>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>> >>> > >>> >>> >>> DefaultChannelPipeline.fireUserEventTriggered(DefaultChannelPipeline.java: >>> >>> > 913) >>> >>> > at org.apache.flink.runtime.io >>> .network.netty.PartitionRequestQueue >>> >>> > .lambda$notifyReaderNonEmpty$0(PartitionRequestQueue.java:87) >>> >>> > at org.apache.flink.shaded.netty4.io.netty.util.concurrent. >>> >>> > AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) >>> >>> > at org.apache.flink.shaded.netty4.io.netty.util.concurrent. >>> >>> > >>> >>> >>> SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) >>> >>> > at >>> >>> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop >>> >>> > .run(EpollEventLoop.java:387) >>> >>> > ... 3 more >>> >>> > Caused by: org.apache.flink.util.FlinkException: JobManager >>> responsible >>> >>> for >>> >>> > eb5d2893c4c6f4034995b9c8e180f01e lost the leadership. >>> >>> > at org.apache.flink.runtime.taskexecutor.TaskExecutor >>> >>> > .disconnectJobManagerConnection(TaskExecutor.java:1422) >>> >>> > at >>> org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1300( >>> >>> > TaskExecutor.java:174) >>> >>> > at org.apache.flink.runtime.taskexecutor. >>> >>> > >>> TaskExecutor$JobLeaderListenerImpl.lambda$null$2(TaskExecutor.java:1856) >>> >>> > at java.util.Optional.ifPresent(Optional.java:159) >>> >>> > at org.apache.flink.runtime.taskexecutor. >>> >>> > >>> TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$3( >>> >>> > TaskExecutor.java:1855) >>> >>> > at >>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync( >>> >>> > AkkaRpcActor.java:404) >>> >>> > at >>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage( >>> >>> > AkkaRpcActor.java:197) >>> >>> > at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage( >>> >>> > AkkaRpcActor.java:154) >>> >>> > at akka.japi.pf >>> .UnitCaseStatement.apply(CaseStatements.scala:26) >>> >>> > at akka.japi.pf >>> .UnitCaseStatement.apply(CaseStatements.scala:21) >>> >>> > at >>> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) >>> >>> > at akka.japi.pf >>> >>> .UnitCaseStatement.applyOrElse(CaseStatements.scala:21) >>> >>> > at >>> >>> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) >>> >>> > at >>> >>> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) >>> >>> > at >>> >>> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) >>> >>> > at akka.actor.Actor$class.aroundReceive(Actor.scala:517) >>> >>> > at >>> akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) >>> >>> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) >>> >>> > at akka.actor.ActorCell.invoke(ActorCell.scala:561) >>> >>> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) >>> >>> > at akka.dispatch.Mailbox.run(Mailbox.scala:225) >>> >>> > at akka.dispatch.Mailbox.exec(Mailbox.scala:235) >>> >>> > at >>> akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>> >>> > at >>> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool >>> >>> > .java:1339) >>> >>> > at >>> >>> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >>> >>> > at >>> >>> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread >>> >>> > .java:107) >>> >>> > Caused by: java.lang.Exception: Job leader for job id >>> >>> > eb5d2893c4c6f4034995b9c8e180f01e lost leadership. >>> >>> > ... 24 more >>> >>> > >>> >>> > >>> >>> > (1)zookeeper的超时设置的是60s,感觉网络异常zk超时不至于60s都不够。 >>> >>> > (2)akka.ask.timeout: 60s >>> >>> > taskmanager.network.request-backoff.max: 60000 >>> >>> > akka此参数之前也调整为60s了。 >>> >>> > >>> >>> > 如上信息,希望社区同学们给点思路。 >>> >>> > >>> >>> >>> >>> >>> >> |
希望有大佬给下这些参数的区别。如果环境的网络不好,纠结要调整哪个参数?还是哪些参数。 我目前只提高了 ask.timeout 。
目前看配置,太多与timeout相关的参数了。 akka.ask.timeout akka.lookup.timeout akka.retry-gate-closed-for akka.tcp.timeout akka.startup-timeout heartbeat.interval heartbeat.timeout high-availability.zookeeper.client.connection-timeout high-availability.zookeeper.client.session-timeout taskmanager.network.request-backoff.max ... yidan zhao <[hidden email]> 于2021年3月10日周三 下午1:13写道: > 今天对比了下G1和copy-MarkSweepCompact的效果。 > 运行相同时间, 相同任务。 G1的GC时长更长,但是次数更多,因为每次GC的时间更短。 > 1h15min时间,G1的gc 1100+次,平均每次1s左右。 后者gc 205次,平均每次1.9s左右。 > > yidan zhao <[hidden email]> 于2021年3月9日周二 下午7:30写道: > >> 补充,还有就是GC收集器,是否无脑使用G1就可以呢?我之前一直是G1,只是最近修改了opts不小心换成其他了。本意不是为了换GC收集器的。 >> >> yidan zhao <[hidden email]> 于2021年3月9日周二 下午7:26写道: >> >>> 观察了下。CPU什么的有尖刺,但是也算基本正常,因为我的任务就是5分钟一波。基本每5分钟都有个尖刺。 >>> 然后目前通过Flink的web-ui看了下gc情况。 >>> 发现部分集群的fgc的确有问题,fgc平均大概达到10-20s,当然只有平均值,不清楚是否有某些gc时间更长情况。总难题来说10-20s的确是比较长的,这个我之后会去看看改进下。 >>> >>> (1)不过不清楚这个是否和这个问题直接相关,因为20s的卡顿是否足以引起该问题呢? >>> (2)此外,大家推荐个内存设置,比如你们都多少TM,每个TM多少内存,跑的任务多大数据量大概。 >>> 我目前5个TM的集群,单TM100G内存,跑任务大概10w >>> qps的入口流量,但是很大部分呢会过滤掉,后续部分流量较少。此外,检查点大概达到3-4GB。 >>> >>> >>> Michael Ran <[hidden email]> 于2021年3月9日周二 下午4:27写道: >>> >>>> 看看当时的负载呢?有没有过高的情况,是什么原因。然后监控下网络和磁盘 >>>> 在 2021-03-09 14:57:43,"yidan zhao" <[hidden email]> 写道: >>>> >而且大家推荐怎么设置呢,我可能默认就G1了。不清楚G1是否也需要精调。 >>>> >我目前设置的内存还是比较大的。(50G的,100G的TaskManager都有),这么大heap,是否需要特别设置啥呢? >>>> > >>>> >或者是否有必要拆小,比如设置10Gheap,然后把taskmanager数量提上去。 >>>> > >>>> >yidan zhao <[hidden email]> 于2021年3月9日周二 下午2:56写道: >>>> > >>>> >> 好的,我会看下。 >>>> >> 然后我今天发现我好多个集群GC collector不一样。 >>>> >> 目前发现3种,默认的是G1。flink conf中配置了env.java.opts: >>>> >> "-XX:-OmitStackTraceInFastThrow"的情况出现了2种,一种是Parallel GC with 83 >>>> >> threads,还有一种是Mark Sweep Compact GC。 >>>> >> 大佬们,Flink是根据内存大小有什么动态调整吗。 >>>> >> >>>> >> >>>> >> >>>> 不使用G1我大概理解了,可能设置了java.opts这个是覆盖,不是追加。本身我只是希望设置下-XX:-OmitStackTraceInFastThrow而已。 >>>> >> >>>> >> >>>> >> 杨杰 <[hidden email]> 于2021年3月8日周一 下午3:09写道: >>>> >> >>>> >>> Hi, >>>> >>> >>>> >>> 可以排查下 GC 情况,频繁 FGC 也会导致这些情况。 >>>> >>> >>>> >>> Best, >>>> >>> jjiey >>>> >>> >>>> >>> > 2021年3月8日 14:37,yidan zhao <[hidden email]> 写道: >>>> >>> > >>>> >>> > >>>> >>> >>>> 如题,我有个任务频繁发生该异常然后重启。今天任务启动1h后,看了下WEB-UI的检查点也没,restored达到了8已经。然后Exception页面显示该错误,估计大多数都是因为该错误导致的restore。 >>>> >>> > 除此外,就是 ‘Job leader for job id eb5d2893c4c6f4034995b9c8e180f01e >>>> lost >>>> >>> > leadership’ 错导致任务重启。 >>>> >>> > >>>> >>> > 下面给出刚刚的一个错误日志(环境flink1.12,standalone集群,5JM+5TM,JM和TM混部在相同机器): >>>> >>> > 2021-03-08 14:31:40 >>>> >>> > org.apache.flink.runtime.io >>>> >>> .network.netty.exception.RemoteTransportException: >>>> >>> > Error at remote task manager '10.35.185.38/10.35.185.38:2016'. >>>> >>> > at org.apache.flink.runtime.io.network.netty. >>>> >>> > CreditBasedPartitionRequestClientHandler.decodeMsg( >>>> >>> > CreditBasedPartitionRequestClientHandler.java:294) >>>> >>> > at org.apache.flink.runtime.io.network.netty. >>>> >>> > CreditBasedPartitionRequestClientHandler.channelRead( >>>> >>> > CreditBasedPartitionRequestClientHandler.java:183) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > AbstractChannelHandlerContext.invokeChannelRead( >>>> >>> > AbstractChannelHandlerContext.java:379) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > AbstractChannelHandlerContext.invokeChannelRead( >>>> >>> > AbstractChannelHandlerContext.java:365) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > >>>> >>> >>>> AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext >>>> >>> > .java:357) >>>> >>> > at org.apache.flink.runtime.io.network.netty. >>>> >>> > NettyMessageClientDecoderDelegate.channelRead( >>>> >>> > NettyMessageClientDecoderDelegate.java:115) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > AbstractChannelHandlerContext.invokeChannelRead( >>>> >>> > AbstractChannelHandlerContext.java:379) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > AbstractChannelHandlerContext.invokeChannelRead( >>>> >>> > AbstractChannelHandlerContext.java:365) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > >>>> >>> >>>> AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext >>>> >>> > .java:357) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > >>>> >>> >>>> DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java: >>>> >>> > 1410) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > AbstractChannelHandlerContext.invokeChannelRead( >>>> >>> > AbstractChannelHandlerContext.java:379) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > AbstractChannelHandlerContext.invokeChannelRead( >>>> >>> > AbstractChannelHandlerContext.java:365) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > >>>> DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel.epoll. >>>> >>> > AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady( >>>> >>> > AbstractEpollStreamChannel.java:792) >>>> >>> > at >>>> >>> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop >>>> >>> > .processReady(EpollEventLoop.java:475) >>>> >>> > at >>>> >>> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop >>>> >>> > .run(EpollEventLoop.java:378) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.util.concurrent. >>>> >>> > >>>> SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.util.internal. >>>> >>> > ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) >>>> >>> > at java.lang.Thread.run(Thread.java:748) >>>> >>> > Caused by: org.apache.flink.runtime.io.network.partition. >>>> >>> > ProducerFailedException: org.apache.flink.util.FlinkException: >>>> >>> JobManager >>>> >>> > responsible for eb5d2893c4c6f4034995b9c8e180f01e lost the >>>> leadership. >>>> >>> > at org.apache.flink.runtime.io >>>> .network.netty.PartitionRequestQueue >>>> >>> > >>>> .writeAndFlushNextMessageIfPossible(PartitionRequestQueue.java:221) >>>> >>> > at org.apache.flink.runtime.io >>>> .network.netty.PartitionRequestQueue >>>> >>> > .enqueueAvailableReader(PartitionRequestQueue.java:108) >>>> >>> > at org.apache.flink.runtime.io >>>> .network.netty.PartitionRequestQueue >>>> >>> > .userEventTriggered(PartitionRequestQueue.java:170) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > AbstractChannelHandlerContext.invokeUserEventTriggered( >>>> >>> > AbstractChannelHandlerContext.java:346) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > AbstractChannelHandlerContext.invokeUserEventTriggered( >>>> >>> > AbstractChannelHandlerContext.java:332) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > AbstractChannelHandlerContext.fireUserEventTriggered( >>>> >>> > AbstractChannelHandlerContext.java:324) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > >>>> >>> >>>> ChannelInboundHandlerAdapter.userEventTriggered(ChannelInboundHandlerAdapter >>>> >>> > .java:117) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.handler.codec. >>>> >>> > >>>> ByteToMessageDecoder.userEventTriggered(ByteToMessageDecoder.java:365) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > AbstractChannelHandlerContext.invokeUserEventTriggered( >>>> >>> > AbstractChannelHandlerContext.java:346) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > AbstractChannelHandlerContext.invokeUserEventTriggered( >>>> >>> > AbstractChannelHandlerContext.java:332) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > AbstractChannelHandlerContext.fireUserEventTriggered( >>>> >>> > AbstractChannelHandlerContext.java:324) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > >>>> >>> >>>> DefaultChannelPipeline$HeadContext.userEventTriggered(DefaultChannelPipeline >>>> >>> > .java:1428) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > AbstractChannelHandlerContext.invokeUserEventTriggered( >>>> >>> > AbstractChannelHandlerContext.java:346) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > AbstractChannelHandlerContext.invokeUserEventTriggered( >>>> >>> > AbstractChannelHandlerContext.java:332) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > >>>> >>> >>>> DefaultChannelPipeline.fireUserEventTriggered(DefaultChannelPipeline.java: >>>> >>> > 913) >>>> >>> > at org.apache.flink.runtime.io >>>> .network.netty.PartitionRequestQueue >>>> >>> > .lambda$notifyReaderNonEmpty$0(PartitionRequestQueue.java:87) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.util.concurrent. >>>> >>> > AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.util.concurrent. >>>> >>> > >>>> >>> >>>> SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) >>>> >>> > at >>>> >>> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop >>>> >>> > .run(EpollEventLoop.java:387) >>>> >>> > ... 3 more >>>> >>> > Caused by: org.apache.flink.util.FlinkException: JobManager >>>> responsible >>>> >>> for >>>> >>> > eb5d2893c4c6f4034995b9c8e180f01e lost the leadership. >>>> >>> > at org.apache.flink.runtime.taskexecutor.TaskExecutor >>>> >>> > .disconnectJobManagerConnection(TaskExecutor.java:1422) >>>> >>> > at >>>> org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1300( >>>> >>> > TaskExecutor.java:174) >>>> >>> > at org.apache.flink.runtime.taskexecutor. >>>> >>> > >>>> TaskExecutor$JobLeaderListenerImpl.lambda$null$2(TaskExecutor.java:1856) >>>> >>> > at java.util.Optional.ifPresent(Optional.java:159) >>>> >>> > at org.apache.flink.runtime.taskexecutor. >>>> >>> > >>>> TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$3( >>>> >>> > TaskExecutor.java:1855) >>>> >>> > at >>>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync( >>>> >>> > AkkaRpcActor.java:404) >>>> >>> > at >>>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage( >>>> >>> > AkkaRpcActor.java:197) >>>> >>> > at >>>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage( >>>> >>> > AkkaRpcActor.java:154) >>>> >>> > at akka.japi.pf >>>> .UnitCaseStatement.apply(CaseStatements.scala:26) >>>> >>> > at akka.japi.pf >>>> .UnitCaseStatement.apply(CaseStatements.scala:21) >>>> >>> > at >>>> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) >>>> >>> > at akka.japi.pf >>>> >>> .UnitCaseStatement.applyOrElse(CaseStatements.scala:21) >>>> >>> > at >>>> >>> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) >>>> >>> > at >>>> >>> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) >>>> >>> > at >>>> >>> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) >>>> >>> > at akka.actor.Actor$class.aroundReceive(Actor.scala:517) >>>> >>> > at >>>> akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) >>>> >>> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) >>>> >>> > at akka.actor.ActorCell.invoke(ActorCell.scala:561) >>>> >>> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) >>>> >>> > at akka.dispatch.Mailbox.run(Mailbox.scala:225) >>>> >>> > at akka.dispatch.Mailbox.exec(Mailbox.scala:235) >>>> >>> > at >>>> akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>>> >>> > at >>>> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool >>>> >>> > .java:1339) >>>> >>> > at >>>> >>> >>>> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >>>> >>> > at >>>> >>> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread >>>> >>> > .java:107) >>>> >>> > Caused by: java.lang.Exception: Job leader for job id >>>> >>> > eb5d2893c4c6f4034995b9c8e180f01e lost leadership. >>>> >>> > ... 24 more >>>> >>> > >>>> >>> > >>>> >>> > (1)zookeeper的超时设置的是60s,感觉网络异常zk超时不至于60s都不够。 >>>> >>> > (2)akka.ask.timeout: 60s >>>> >>> > taskmanager.network.request-backoff.max: 60000 >>>> >>> > akka此参数之前也调整为60s了。 >>>> >>> > >>>> >>> > 如上信息,希望社区同学们给点思路。 >>>> >>> > >>>> >>> >>>> >>> >>>> >>> |
Free forum by Nabble | Edit this page |