相同的作业配置 ,Flink1.12 版本的作业checkpoint耗时增加以及制作失败,Flink1.9的作业运行正常

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

相同的作业配置 ,Flink1.12 版本的作业checkpoint耗时增加以及制作失败,Flink1.9的作业运行正常

Haihang Jing
【现象】相同配置的作业(checkpoint interval :3分钟,作业逻辑:regular
join),flink1.9运行正常,flink1.12运行一段时间后,checkpoint制作耗时增大,最后checkpoint制作失败。

【分析】了解到flink1.10后对于checkpoint机制进行调整,接收端在barrier对齐时不会缓存单个barrier到达后的数据,意味着发送方必须在barrier对齐后等待credit
feedback来传输数据,因此发送方会产生一定的冷启动,影响到延迟和网络吞吐量,因此调整checkpoint
interval为10分钟进行对比测试,发现调整后(interval为10),flink1.12上运行的作业运行正常。
相关issue:https://issues.apache.org/jira/browse/FLINK-16404

【问题】1.想咨询下大家有遇到过相同的情况么?
            2.flink1.12的作业checkpoint间隔对作业的影响具体有多大?官方有测试么?

checkpoint interval为3分钟的flink1.12作业运行5小时后,checkpoint制作失败,具体异常栈:

org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable
failure threshold.

        at
org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleCheckpointException(CheckpointFailureManager.java:96)

        at
org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleJobLevelCheckpointException(CheckpointFailureManager.java:65)

        at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1924)

        at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1897)

        at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.access$600(CheckpointCoordinator.java:93)

        at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:2038)

        at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

        at java.util.concurrent.FutureTask.run(FutureTask.java:266)

        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)

        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)

        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

        at java.lang.Thread.run(Thread.java:745)
Reply | Threaded
Open this post in threaded view
|

Re: 相同的作业配置 ,Flink1.12 版本的作业checkpoint耗时增加以及制作失败,Flink1.9的作业运行正常

Yingjie Cao
这个应该不是FLINK-16404
<https://issues.apache.org/jira/browse/FLINK-16404>的影响,那个对checkpoint时间的影响比较小,是已经有一个benchmark测试的,1s的checkpoint
interval也没什么大问题,我建议可以看一下失败的task的stack,看一下在干什么,可能排查问题更快一些。

Haihang Jing <[hidden email]> 于2021年3月24日周三 下午12:06写道:

> 【现象】相同配置的作业(checkpoint interval :3分钟,作业逻辑:regular
> join),flink1.9运行正常,flink1.12运行一段时间后,checkpoint制作耗时增大,最后checkpoint制作失败。
>
>
> 【分析】了解到flink1.10后对于checkpoint机制进行调整,接收端在barrier对齐时不会缓存单个barrier到达后的数据,意味着发送方必须在barrier对齐后等待credit
> feedback来传输数据,因此发送方会产生一定的冷启动,影响到延迟和网络吞吐量,因此调整checkpoint
> interval为10分钟进行对比测试,发现调整后(interval为10),flink1.12上运行的作业运行正常。
> 相关issue:https://issues.apache.org/jira/browse/FLINK-16404
>
> 【问题】1.想咨询下大家有遇到过相同的情况么?
>             2.flink1.12的作业checkpoint间隔对作业的影响具体有多大?官方有测试么?
>
> checkpoint interval为3分钟的flink1.12作业运行5小时后,checkpoint制作失败,具体异常栈:
>
> org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable
> failure threshold.
>
>         at
>
> org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleCheckpointException(CheckpointFailureManager.java:96)
>
>         at
>
> org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleJobLevelCheckpointException(CheckpointFailureManager.java:65)
>
>         at
>
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1924)
>
>         at
>
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1897)
>
>         at
>
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.access$600(CheckpointCoordinator.java:93)
>
>         at
>
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:2038)
>
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>
>         at
>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>
>         at
>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>
>         at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>
>         at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>
>         at java.lang.Thread.run(Thread.java:745)
>