flink 1.10.1 跑批任务 OutOfMemoryError: Metaspace

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

flink 1.10.1 跑批任务 OutOfMemoryError: Metaspace

胡松
hi all
   使用flink 1.10.1 每10分钟跑一个批任务,但是跑一天后重复复现报错
2020-08-15 19:32:59
org.apache.flink.runtime.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy
        at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:110)
        at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:76)
        at org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:192)
        at org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:186)
        at org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:180)
        at org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:498)
        at org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:384)
        at sun.reflect.GeneratedMethodAccessor250.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:282)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:197)
        at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:150)
        at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
        at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
        at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
        at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
        at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
        at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
        at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
        at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
        at akka.actor.ActorCell.invoke(ActorCell.scala:561)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
        at akka.dispatch.Mailbox.run(Mailbox.scala:225)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
        at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error has occurred. This can mean two things: either the job requires a larger size of JVM metaspace to load classes or there is a class loading leak. In the first case 'taskmanager.memory.jvm-metaspace.size' configuration option should be increased. If the error persists (usually in cluster after several job (re-)submissions) then there is probably a class loading leak which has to be investigated and fixed. The task executor has to be shutdown...

使用MemoryAnalyzer分析,dump 700m 的文件就70多m,没定位到原因。


请问各位有碰到批任务这种问题的么?
Reply | Threaded
Open this post in threaded view
|

Re: flink 1.10.1 跑批任务 OutOfMemoryError: Metaspace

Xintong Song
按你的描述,应该是存在类加载泄露的问题。也就是说,由于某些原因,导致之前作业加载的类,没能被释放掉,致使类元数据积累越来越多,metaspace
空间不足。
具体泄露的原因还是需要根据 dump 分析,通常是作业用到的第三方依赖导致的,这种情况 flink 是没法强行清除加载类的。

Thank you~

Xintong Song



On Mon, Aug 17, 2020 at 6:38 PM 胡松 <[hidden email]> wrote:

> hi all
> &nbsp; &nbsp;使用flink 1.10.1 每10分钟跑一个批任务,但是跑一天后重复复现报错
> 2020-08-15 19:32:59
> org.apache.flink.runtime.JobException: Recovery is suppressed by
> NoRestartBackoffTimeStrategy
>         at
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:110)
>         at
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:76)
>         at
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:192)
>         at
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:186)
>         at
> org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:180)
>         at
> org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:498)
>         at
> org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:384)
>         at sun.reflect.GeneratedMethodAccessor250.invoke(Unknown Source)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:282)
>         at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:197)
>         at
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
>         at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:150)
>         at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>         at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
>         at
> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>         at akka.japi.pf
> .UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
>         at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>         at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>         at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>         at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
>         at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
>         at akka.actor.ActorCell.invoke(ActorCell.scala:561)
>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
>         at akka.dispatch.Mailbox.run(Mailbox.scala:225)
>         at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
>         at
> akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>         at
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>         at
> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>         at
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: java.lang.OutOfMemoryError: Metaspace. The metaspace
> out-of-memory error has occurred. This can mean two things: either the job
> requires a larger size of JVM metaspace to load classes or there is a class
> loading leak. In the first case 'taskmanager.memory.jvm-metaspace.size'
> configuration option should be increased. If the error persists (usually in
> cluster after several job (re-)submissions) then there is probably a class
> loading leak which has to be investigated and fixed. The task executor has
> to be shutdown...
>
> 使用MemoryAnalyzer分析,dump 700m 的文件就70多m,没定位到原因。
>
>
> 请问各位有碰到批任务这种问题的么?
Reply | Threaded
Open this post in threaded view
|

回复: flink 1.10.1 跑批任务 OutOfMemoryError: Metaspace

胡松
@Xintong Song
我们dump文件 这个driver是由ParentFirstClassloader加载的,存在DriverManagerregisteredDrivers里,存在对ParentFirstClassloader的强引用。我们初始化mysql的代码如下,难道需要手动卸载下driver,然后各位有知道怎样手动卸载么或者有使用HikariDataSource没问题的么,而且我们用的是批处理,每10分钟执行次
                



          this.config = new HikariConfig();
        this.config.setDriverClassName("com.mysql.jdbc.Driver");
        this.config.setJdbcUrl(config.getConnectString());
        this.config.setUsername(config.getUsername());
        this.config.setPassword(config.getPassword());


  
        this.config.setMinimumIdle(config.getCpMinimumIdle());
        this.config.setMaximumPoolSize(config.getCpMaximumPoolSize());
        this.config.setIdleTimeout(config.getCpIdleTimeout());
        this.config.setMaxLifetime(config.getCpMaxLifetime());

        this.config.setAutoCommit(false);
        this.source = new HikariDataSource(this.config);

        return true;

------------------ 原始邮件 ------------------
发件人: "user-zh" <[hidden email]>;
发送时间: 2020年8月18日(星期二) 下午5:04
收件人: "user-zh"<[hidden email]>;
主题: Re: flink 1.10.1 跑批任务 OutOfMemoryError: Metaspace

按你的描述,应该是存在类加载泄露的问题。也就是说,由于某些原因,导致之前作业加载的类,没能被释放掉,致使类元数据积累越来越多,metaspace
空间不足。
具体泄露的原因还是需要根据 dump 分析,通常是作业用到的第三方依赖导致的,这种情况 flink 是没法强行清除加载类的。

Thank you~

Xintong Song



On Mon, Aug 17, 2020 at 6:38 PM 胡松 <[hidden email]> wrote:

> hi all
> &nbsp; &nbsp;使用flink 1.10.1 每10分钟跑一个批任务,但是跑一天后重复复现报错
> 2020-08-15 19:32:59
> org.apache.flink.runtime.JobException: Recovery is suppressed by
> NoRestartBackoffTimeStrategy
>         at
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:110)
>         at
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:76)
>         at
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:192)
>         at
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:186)
>         at
> org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:180)
>         at
> org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:498)
>         at
> org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:384)
>         at sun.reflect.GeneratedMethodAccessor250.invoke(Unknown Source)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:282)
>         at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:197)
>         at
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
>         at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:150)
>         at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>         at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
>         at
> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>         at akka.japi.pf
> .UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
>         at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>         at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>         at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>         at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
>         at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
>         at akka.actor.ActorCell.invoke(ActorCell.scala:561)
>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
>         at akka.dispatch.Mailbox.run(Mailbox.scala:225)
>         at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
>         at
> akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>         at
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>         at
> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>         at
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: java.lang.OutOfMemoryError: Metaspace. The metaspace
> out-of-memory error has occurred. This can mean two things: either the job
> requires a larger size of JVM metaspace to load classes or there is a class
> loading leak. In the first case 'taskmanager.memory.jvm-metaspace.size'
> configuration option should be increased. If the error persists (usually in
> cluster after several job (re-)submissions) then there is probably a class
> loading leak which has to be investigated and fixed. The task executor has
> to be shutdown...
>
> 使用MemoryAnalyzer分析,dump 700m 的文件就70多m,没定位到原因。
>
>
> 请问各位有碰到批任务这种问题的么?

Reply | Threaded
Open this post in threaded view
|

Re: flink 1.10.1 跑批任务 OutOfMemoryError: Metaspace

codeleven
In reply to this post by 胡松
你好,不知道你得问题解决了没有
我在使用Flink得时候也遇到了类似得问题,主要是mysql重复加载导致的问题。
这是我的解决方案,如果对你有帮助,我感到很高兴:
Flink-MetaSpace OOM <https://www.yuque.com/codeleven/flink/dgygq2>  



--
Sent from: http://apache-flink.147419.n8.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: flink 1.10.1 跑批任务 OutOfMemoryError: Metaspace

Xintong Song
@胡松
图片显示不出来,你可能需要借助一些第三方的图床工具

Thank you~

Xintong Song



On Thu, Aug 20, 2020 at 9:24 AM codeleven <[hidden email]> wrote:

> 你好,不知道你得问题解决了没有
> 我在使用Flink得时候也遇到了类似得问题,主要是mysql重复加载导致的问题。
> 这是我的解决方案,如果对你有帮助,我感到很高兴:
> Flink-MetaSpace OOM <https://www.yuque.com/codeleven/flink/dgygq2>
>
>
>
> --
> Sent from: http://apache-flink.147419.n8.nabble.com/
Reply | Threaded
Open this post in threaded view
|

回复: flink 1.10.1 跑批任务 OutOfMemoryError: Metaspace

void
hi all
&nbsp; &nbsp; &nbsp;https://www.yuque.com/codeleven/flink/dgygq2 这个链接404
&nbsp; &nbsp; &nbsp;目前我们存在2个问题
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1. 每10分钟提交次,导致jm的磁盘打满
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2.&nbsp;Metaspace oom问题,每次执行一次,Metaspace会增长13m左右,dump文件分析 ,执行结束后,ParentFirstClassLoader还有100多个实例。存在hdfs conf,mysql&nbsp;driver及sun.security.jca.Providers强引用。导致类不能卸载。
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 目前这个问题也是社区正在解决的问题。

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;https://issues.apache.org/jira/browse/FLINK-11205

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;https://issues.apache.org/jira/browse/FLINK-16225

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;https://issues.apache.org/jira/browse/FLINK-16245

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 这几个PR只能缓解几个场景,不能完全解决。




&nbsp; &nbsp; &nbsp; &nbsp; 目前我们打算定期重启集群或者用spark跑批任务
------------------&nbsp;原始邮件&nbsp;------------------
发件人:                                                                                                                        "user-zh"                                                                                    <[hidden email]&gt;;
发送时间:&nbsp;2020年8月20日(星期四) 上午9:31
收件人:&nbsp;"user-zh"<[hidden email]&gt;;

主题:&nbsp;Re: flink 1.10.1 跑批任务 OutOfMemoryError: Metaspace



@胡松
图片显示不出来,你可能需要借助一些第三方的图床工具

Thank you~

Xintong Song



On Thu, Aug 20, 2020 at 9:24 AM codeleven <[hidden email]&gt; wrote:

&gt; 你好,不知道你得问题解决了没有
&gt; 我在使用Flink得时候也遇到了类似得问题,主要是mysql重复加载导致的问题。
&gt; 这是我的解决方案,如果对你有帮助,我感到很高兴:
&gt; Flink-MetaSpace OOM <https://www.yuque.com/codeleven/flink/dgygq2&gt;
&gt;
&gt;
&gt;
&gt; --
&gt; Sent from: http://apache-flink.147419.n8.nabble.com/
kcz
Reply | Threaded
Open this post in threaded view
|

回复: flink 1.10.1 跑批任务 OutOfMemoryError: Metaspace

kcz
In reply to this post by codeleven
大佬文章果然清晰易懂,这个问题我曾经在ES5遇到过类似的,经历你这么一波解析,pretty good。




------------------&nbsp;原始邮件&nbsp;------------------
发件人:                                                                                                                        "user-zh"                                                                                    <[hidden email]&gt;;
发送时间:&nbsp;2020年8月19日(星期三) 晚上10:48
收件人:&nbsp;"user-zh"<[hidden email]&gt;;

主题:&nbsp;Re: flink 1.10.1 跑批任务 OutOfMemoryError: Metaspace



你好,不知道你得问题解决了没有
我在使用Flink得时候也遇到了类似得问题,主要是mysql重复加载导致的问题。
这是我的解决方案,如果对你有帮助,我感到很高兴:
Flink-MetaSpace OOM <https://www.yuque.com/codeleven/flink/dgygq2&gt;&nbsp; 



--
Sent from: http://apache-flink.147419.n8.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: 回复: flink 1.10.1 跑批任务 OutOfMemoryError: Metaspace

wulishan
你点的进去页面吗 我点不进去