hi all
使用flink 1.10.1 每10分钟跑一个批任务,但是跑一天后重复复现报错 2020-08-15 19:32:59 org.apache.flink.runtime.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:110) at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:76) at org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:192) at org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:186) at org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:180) at org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:498) at org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:384) at sun.reflect.GeneratedMethodAccessor250.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:282) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:197) at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:150) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at akka.actor.Actor$class.aroundReceive(Actor.scala:517) at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) at akka.actor.ActorCell.invoke(ActorCell.scala:561) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) at akka.dispatch.Mailbox.run(Mailbox.scala:225) at akka.dispatch.Mailbox.exec(Mailbox.scala:235) at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error has occurred. This can mean two things: either the job requires a larger size of JVM metaspace to load classes or there is a class loading leak. In the first case 'taskmanager.memory.jvm-metaspace.size' configuration option should be increased. If the error persists (usually in cluster after several job (re-)submissions) then there is probably a class loading leak which has to be investigated and fixed. The task executor has to be shutdown... 使用MemoryAnalyzer分析,dump 700m 的文件就70多m,没定位到原因。 请问各位有碰到批任务这种问题的么? |
按你的描述,应该是存在类加载泄露的问题。也就是说,由于某些原因,导致之前作业加载的类,没能被释放掉,致使类元数据积累越来越多,metaspace
空间不足。 具体泄露的原因还是需要根据 dump 分析,通常是作业用到的第三方依赖导致的,这种情况 flink 是没法强行清除加载类的。 Thank you~ Xintong Song On Mon, Aug 17, 2020 at 6:38 PM 胡松 <[hidden email]> wrote: > hi all > 使用flink 1.10.1 每10分钟跑一个批任务,但是跑一天后重复复现报错 > 2020-08-15 19:32:59 > org.apache.flink.runtime.JobException: Recovery is suppressed by > NoRestartBackoffTimeStrategy > at > org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:110) > at > org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:76) > at > org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:192) > at > org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:186) > at > org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:180) > at > org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:498) > at > org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:384) > at sun.reflect.GeneratedMethodAccessor250.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:282) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:197) > at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:150) > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) > at > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) > at akka.japi.pf > .UnitCaseStatement.applyOrElse(CaseStatements.scala:21) > at > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) > at > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > at > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > at akka.actor.Actor$class.aroundReceive(Actor.scala:517) > at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) > at akka.actor.ActorCell.invoke(ActorCell.scala:561) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) > at akka.dispatch.Mailbox.run(Mailbox.scala:225) > at akka.dispatch.Mailbox.exec(Mailbox.scala:235) > at > akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Caused by: java.lang.OutOfMemoryError: Metaspace. The metaspace > out-of-memory error has occurred. This can mean two things: either the job > requires a larger size of JVM metaspace to load classes or there is a class > loading leak. In the first case 'taskmanager.memory.jvm-metaspace.size' > configuration option should be increased. If the error persists (usually in > cluster after several job (re-)submissions) then there is probably a class > loading leak which has to be investigated and fixed. The task executor has > to be shutdown... > > 使用MemoryAnalyzer分析,dump 700m 的文件就70多m,没定位到原因。 > > > 请问各位有碰到批任务这种问题的么? |
@Xintong Song 我们dump文件 这个driver是由ParentFirstClassloader加载的,存在DriverManager的registeredDrivers里,存在对ParentFirstClassloader的强引用。我们初始化mysql的代码如下,难道需要手动卸载下driver,然后各位有知道怎样手动卸载么或者有使用HikariDataSource没问题的么,而且我们用的是批处理,每10分钟执行次? this.config = new HikariConfig(); this.config.setDriverClassName("com.mysql.jdbc.Driver"); this.config.setJdbcUrl(config.getConnectString()); this.config.setUsername(config.getUsername()); this.config.setPassword(config.getPassword()); this.config.setMinimumIdle(config.getCpMinimumIdle()); this.config.setMaximumPoolSize(config.getCpMaximumPoolSize()); this.config.setIdleTimeout(config.getCpIdleTimeout()); this.config.setMaxLifetime(config.getCpMaxLifetime()); this.config.setAutoCommit(false); this.source = new HikariDataSource(this.config); return true; ------------------ 原始邮件 ------------------ 发件人: "user-zh" <[hidden email]>; 发送时间: 2020年8月18日(星期二) 下午5:04 收件人: "user-zh"<[hidden email]>; 主题: Re: flink 1.10.1 跑批任务 OutOfMemoryError: Metaspace 空间不足。 具体泄露的原因还是需要根据 dump 分析,通常是作业用到的第三方依赖导致的,这种情况 flink 是没法强行清除加载类的。 Thank you~ Xintong Song On Mon, Aug 17, 2020 at 6:38 PM 胡松 <[hidden email]> wrote: > hi all > 使用flink 1.10.1 每10分钟跑一个批任务,但是跑一天后重复复现报错 > 2020-08-15 19:32:59 > org.apache.flink.runtime.JobException: Recovery is suppressed by > NoRestartBackoffTimeStrategy > at > org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:110) > at > org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:76) > at > org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:192) > at > org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:186) > at > org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:180) > at > org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:498) > at > org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:384) > at sun.reflect.GeneratedMethodAccessor250.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:282) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:197) > at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:150) > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) > at > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) > at akka.japi.pf > .UnitCaseStatement.applyOrElse(CaseStatements.scala:21) > at > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) > at > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > at > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > at akka.actor.Actor$class.aroundReceive(Actor.scala:517) > at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) > at akka.actor.ActorCell.invoke(ActorCell.scala:561) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) > at akka.dispatch.Mailbox.run(Mailbox.scala:225) > at akka.dispatch.Mailbox.exec(Mailbox.scala:235) > at > akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Caused by: java.lang.OutOfMemoryError: Metaspace. The metaspace > out-of-memory error has occurred. This can mean two things: either the job > requires a larger size of JVM metaspace to load classes or there is a class > loading leak. In the first case 'taskmanager.memory.jvm-metaspace.size' > configuration option should be increased. If the error persists (usually in > cluster after several job (re-)submissions) then there is probably a class > loading leak which has to be investigated and fixed. The task executor has > to be shutdown... > > 使用MemoryAnalyzer分析,dump 700m 的文件就70多m,没定位到原因。 > > > 请问各位有碰到批任务这种问题的么? |
In reply to this post by 胡松
你好,不知道你得问题解决了没有
我在使用Flink得时候也遇到了类似得问题,主要是mysql重复加载导致的问题。 这是我的解决方案,如果对你有帮助,我感到很高兴: Flink-MetaSpace OOM <https://www.yuque.com/codeleven/flink/dgygq2> -- Sent from: http://apache-flink.147419.n8.nabble.com/ |
@胡松
图片显示不出来,你可能需要借助一些第三方的图床工具 Thank you~ Xintong Song On Thu, Aug 20, 2020 at 9:24 AM codeleven <[hidden email]> wrote: > 你好,不知道你得问题解决了没有 > 我在使用Flink得时候也遇到了类似得问题,主要是mysql重复加载导致的问题。 > 这是我的解决方案,如果对你有帮助,我感到很高兴: > Flink-MetaSpace OOM <https://www.yuque.com/codeleven/flink/dgygq2> > > > > -- > Sent from: http://apache-flink.147419.n8.nabble.com/ |
hi all
https://www.yuque.com/codeleven/flink/dgygq2 这个链接404 目前我们存在2个问题 1. 每10分钟提交次,导致jm的磁盘打满 2. Metaspace oom问题,每次执行一次,Metaspace会增长13m左右,dump文件分析 ,执行结束后,ParentFirstClassLoader还有100多个实例。存在hdfs conf,mysql driver及sun.security.jca.Providers强引用。导致类不能卸载。 目前这个问题也是社区正在解决的问题。 https://issues.apache.org/jira/browse/FLINK-11205 https://issues.apache.org/jira/browse/FLINK-16225 https://issues.apache.org/jira/browse/FLINK-16245 这几个PR只能缓解几个场景,不能完全解决。 目前我们打算定期重启集群或者用spark跑批任务 ------------------ 原始邮件 ------------------ 发件人: "user-zh" <[hidden email]>; 发送时间: 2020年8月20日(星期四) 上午9:31 收件人: "user-zh"<[hidden email]>; 主题: Re: flink 1.10.1 跑批任务 OutOfMemoryError: Metaspace @胡松 图片显示不出来,你可能需要借助一些第三方的图床工具 Thank you~ Xintong Song On Thu, Aug 20, 2020 at 9:24 AM codeleven <[hidden email]> wrote: > 你好,不知道你得问题解决了没有 > 我在使用Flink得时候也遇到了类似得问题,主要是mysql重复加载导致的问题。 > 这是我的解决方案,如果对你有帮助,我感到很高兴: > Flink-MetaSpace OOM <https://www.yuque.com/codeleven/flink/dgygq2> > > > > -- > Sent from: http://apache-flink.147419.n8.nabble.com/ |
In reply to this post by codeleven
大佬文章果然清晰易懂,这个问题我曾经在ES5遇到过类似的,经历你这么一波解析,pretty good。
------------------ 原始邮件 ------------------ 发件人: "user-zh" <[hidden email]>; 发送时间: 2020年8月19日(星期三) 晚上10:48 收件人: "user-zh"<[hidden email]>; 主题: Re: flink 1.10.1 跑批任务 OutOfMemoryError: Metaspace 你好,不知道你得问题解决了没有 我在使用Flink得时候也遇到了类似得问题,主要是mysql重复加载导致的问题。 这是我的解决方案,如果对你有帮助,我感到很高兴: Flink-MetaSpace OOM <https://www.yuque.com/codeleven/flink/dgygq2> -- Sent from: http://apache-flink.147419.n8.nabble.com/ |
Free forum by Nabble | Edit this page |