Flink 1.12.0 隔几个小时Checkpoint就会失败

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink 1.12.0 隔几个小时Checkpoint就会失败

Frost Wong
Hi 大家好

我用的Flink on yarn模式运行的一个任务,每隔几个小时就会出现一次错误

2021-03-18 08:52:37,019 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed checkpoint 661818 for job 4fa72fc414f53e5ee062f9fbd5a2f4d5 (562357 bytes in 4699 ms).
2021-03-18 08:52:37,637 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 661819 (type=CHECKPOINT) @ 1616028757520 for job 4fa72fc414f53e5ee062f9fbd5a2f4d5.
2021-03-18 08:52:42,956 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed checkpoint 661819 for job 4fa72fc414f53e5ee062f9fbd5a2f4d5 (2233389 bytes in 4939 ms).
2021-03-18 08:52:43,528 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 661820 (type=CHECKPOINT) @ 1616028763457 for job 4fa72fc414f53e5ee062f9fbd5a2f4d5.
2021-03-18 09:12:43,528 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Checkpoint 661820 of job 4fa72fc414f53e5ee062f9fbd5a2f4d5 expired before completing.
2021-03-18 09:12:43,615 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Trying to recover from a global failure.
org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable failure threshold.
at org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleCheckpointException(CheckpointFailureManager.java:90) ~[flink-dist_2.12-1.12.0.jar:1.12.0]
at org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleJobLevelCheckpointException(CheckpointFailureManager.java:65) ~[flink-dist_2.12-1.12.0.jar:1.12.0]
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1760) ~[flink-dist_2.12-1.12.0.jar:1.12.0]
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1733) ~[flink-dist_2.12-1.12.0.jar:1.12.0]
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.access$600(CheckpointCoordinator.java:93) ~[flink-dist_2.12-1.12.0.jar:1.12.0]
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:1870) ~[flink-dist_2.12-1.12.0.jar:1.12.0]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_231]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_231]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) ~[?:1.8.0_231]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ~[?:1.8.0_231]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_231]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_231]
at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_231]
2021-03-18 09:12:43,618 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job csmonitor_comment_strategy (4fa72fc414f53e5ee062f9fbd5a2f4d5) switched from state RUNNING to RESTARTING.
2021-03-18 09:12:43,619 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Flat Map (43/256) (18dec1f23b95f741f5266594621971d5) switched from RUNNING to CANCELING.
2021-03-18 09:12:43,622 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Flat Map (44/256) (3f2ec60b2f3042ceea6e1d660c78d3d7) switched from RUNNING to CANCELING.
2021-03-18 09:12:43,622 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Flat Map (45/256) (66d411c2266ab025b69196dfec30d888) switched from RUNNING to CANCELING.
然后就自己恢复了。用的是Unaligned Checkpoint,rocksdb存储后端,在这个错误前后也没有什么其他报错信息。从Checkpoint的metrics看,总是剩最后一个无法完成,调整过parallelism也无法解决问题。

谢谢大家!
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.12.0 隔几个小时Checkpoint就会失败

nobleyd
设置下检查点失败不影响任务呀,你这貌似还导致任务重启了?

Frost Wong <[hidden email]> 于2021年3月18日周四 上午10:38写道:

> Hi 大家好
>
> 我用的Flink on yarn模式运行的一个任务,每隔几个小时就会出现一次错误
>
> 2021-03-18 08:52:37,019 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed
> checkpoint 661818 for job 4fa72fc414f53e5ee062f9fbd5a2f4d5 (562357 bytes in
> 4699 ms).
> 2021-03-18 08:52:37,637 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] -
> Triggering checkpoint 661819 (type=CHECKPOINT) @ 1616028757520 for job
> 4fa72fc414f53e5ee062f9fbd5a2f4d5.
> 2021-03-18 08:52:42,956 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed
> checkpoint 661819 for job 4fa72fc414f53e5ee062f9fbd5a2f4d5 (2233389 bytes
> in 4939 ms).
> 2021-03-18 08:52:43,528 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] -
> Triggering checkpoint 661820 (type=CHECKPOINT) @ 1616028763457 for job
> 4fa72fc414f53e5ee062f9fbd5a2f4d5.
> 2021-03-18 09:12:43,528 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] -
> Checkpoint 661820 of job 4fa72fc414f53e5ee062f9fbd5a2f4d5 expired before
> completing.
> 2021-03-18 09:12:43,615 INFO
> org.apache.flink.runtime.jobmaster.JobMaster                 [] - Trying to
> recover from a global failure.
> org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable
> failure threshold.
> at
> org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleCheckpointException(CheckpointFailureManager.java:90)
> ~[flink-dist_2.12-1.12.0.jar:1.12.0]
> at
> org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleJobLevelCheckpointException(CheckpointFailureManager.java:65)
> ~[flink-dist_2.12-1.12.0.jar:1.12.0]
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1760)
> ~[flink-dist_2.12-1.12.0.jar:1.12.0]
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1733)
> ~[flink-dist_2.12-1.12.0.jar:1.12.0]
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.access$600(CheckpointCoordinator.java:93)
> ~[flink-dist_2.12-1.12.0.jar:1.12.0]
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:1870)
> ~[flink-dist_2.12-1.12.0.jar:1.12.0]
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> ~[?:1.8.0_231]
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_231]
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> ~[?:1.8.0_231]
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> ~[?:1.8.0_231]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> ~[?:1.8.0_231]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> ~[?:1.8.0_231]
> at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_231]
> 2021-03-18 09:12:43,618 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job
> csmonitor_comment_strategy (4fa72fc414f53e5ee062f9fbd5a2f4d5) switched from
> state RUNNING to RESTARTING.
> 2021-03-18 09:12:43,619 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Flat Map
> (43/256) (18dec1f23b95f741f5266594621971d5) switched from RUNNING to
> CANCELING.
> 2021-03-18 09:12:43,622 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Flat Map
> (44/256) (3f2ec60b2f3042ceea6e1d660c78d3d7) switched from RUNNING to
> CANCELING.
> 2021-03-18 09:12:43,622 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Flat Map
> (45/256) (66d411c2266ab025b69196dfec30d888) switched from RUNNING to
> CANCELING.
> 然后就自己恢复了。用的是Unaligned
> Checkpoint,rocksdb存储后端,在这个错误前后也没有什么其他报错信息。从Checkpoint的metrics看,总是剩最后一个无法完成,调整过parallelism也无法解决问题。
>
> 谢谢大家!
>
Reply | Threaded
Open this post in threaded view
|

回复: Flink 1.12.0 隔几个小时Checkpoint就会失败

Frost Wong
哦哦,我看到了有个

setTolerableCheckpointFailureNumber

之前不知道有这个方法,倒是可以试一下,不过我就是不太理解为什么会失败,也没有任何报错
________________________________
发件人: yidan zhao <[hidden email]>
发送时间: 2021年3月18日 3:47
收件人: user-zh <[hidden email]>
主题: Re: Flink 1.12.0 隔几个小时Checkpoint就会失败

设置下检查点失败不影响任务呀,你这貌似还导致任务重启了?

Frost Wong <[hidden email]> 于2021年3月18日周四 上午10:38写道:

> Hi 大家好
>
> 我用的Flink on yarn模式运行的一个任务,每隔几个小时就会出现一次错误
>
> 2021-03-18 08:52:37,019 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed
> checkpoint 661818 for job 4fa72fc414f53e5ee062f9fbd5a2f4d5 (562357 bytes in
> 4699 ms).
> 2021-03-18 08:52:37,637 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] -
> Triggering checkpoint 661819 (type=CHECKPOINT) @ 1616028757520 for job
> 4fa72fc414f53e5ee062f9fbd5a2f4d5.
> 2021-03-18 08:52:42,956 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed
> checkpoint 661819 for job 4fa72fc414f53e5ee062f9fbd5a2f4d5 (2233389 bytes
> in 4939 ms).
> 2021-03-18 08:52:43,528 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] -
> Triggering checkpoint 661820 (type=CHECKPOINT) @ 1616028763457 for job
> 4fa72fc414f53e5ee062f9fbd5a2f4d5.
> 2021-03-18 09:12:43,528 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] -
> Checkpoint 661820 of job 4fa72fc414f53e5ee062f9fbd5a2f4d5 expired before
> completing.
> 2021-03-18 09:12:43,615 INFO
> org.apache.flink.runtime.jobmaster.JobMaster                 [] - Trying to
> recover from a global failure.
> org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable
> failure threshold.
> at
> org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleCheckpointException(CheckpointFailureManager.java:90)
> ~[flink-dist_2.12-1.12.0.jar:1.12.0]
> at
> org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleJobLevelCheckpointException(CheckpointFailureManager.java:65)
> ~[flink-dist_2.12-1.12.0.jar:1.12.0]
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1760)
> ~[flink-dist_2.12-1.12.0.jar:1.12.0]
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1733)
> ~[flink-dist_2.12-1.12.0.jar:1.12.0]
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.access$600(CheckpointCoordinator.java:93)
> ~[flink-dist_2.12-1.12.0.jar:1.12.0]
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:1870)
> ~[flink-dist_2.12-1.12.0.jar:1.12.0]
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> ~[?:1.8.0_231]
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_231]
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> ~[?:1.8.0_231]
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> ~[?:1.8.0_231]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> ~[?:1.8.0_231]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> ~[?:1.8.0_231]
> at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_231]
> 2021-03-18 09:12:43,618 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job
> csmonitor_comment_strategy (4fa72fc414f53e5ee062f9fbd5a2f4d5) switched from
> state RUNNING to RESTARTING.
> 2021-03-18 09:12:43,619 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Flat Map
> (43/256) (18dec1f23b95f741f5266594621971d5) switched from RUNNING to
> CANCELING.
> 2021-03-18 09:12:43,622 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Flat Map
> (44/256) (3f2ec60b2f3042ceea6e1d660c78d3d7) switched from RUNNING to
> CANCELING.
> 2021-03-18 09:12:43,622 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Flat Map
> (45/256) (66d411c2266ab025b69196dfec30d888) switched from RUNNING to
> CANCELING.
> 然后就自己恢复了。用的是Unaligned
> Checkpoint,rocksdb存储后端,在这个错误前后也没有什么其他报错信息。从Checkpoint的metrics看,总是剩最后一个无法完成,调整过parallelism也无法解决问题。
>
> 谢谢大家!
>
Reply | Threaded
Open this post in threaded view
|

退订

΢Ұ
In reply to this post by nobleyd
------------------&nbsp;原始邮件&nbsp;------------------
发件人:                                                                                                                        "user-zh"                                                                                    <[hidden email]&gt;;
发送时间:&nbsp;2021年3月18日(星期四) 中午11:47
收件人:&nbsp;"user-zh"<[hidden email]&gt;;

主题:&nbsp;Re: Flink 1.12.0 隔几个小时Checkpoint就会失败



设置下检查点失败不影响任务呀,你这貌似还导致任务重启了?

Frost Wong <[hidden email]&gt; 于2021年3月18日周四 上午10:38写道:

&gt; Hi 大家好
&gt;
&gt; 我用的Flink on yarn模式运行的一个任务,每隔几个小时就会出现一次错误
&gt;
&gt; 2021-03-18 08:52:37,019 INFO
&gt; org.apache.flink.runtime.checkpoint.CheckpointCoordinator&nbsp;&nbsp;&nbsp; [] - Completed
&gt; checkpoint 661818 for job 4fa72fc414f53e5ee062f9fbd5a2f4d5 (562357 bytes in
&gt; 4699 ms).
&gt; 2021-03-18 08:52:37,637 INFO
&gt; org.apache.flink.runtime.checkpoint.CheckpointCoordinator&nbsp;&nbsp;&nbsp; [] -
&gt; Triggering checkpoint 661819 (type=CHECKPOINT) @ 1616028757520 for job
&gt; 4fa72fc414f53e5ee062f9fbd5a2f4d5.
&gt; 2021-03-18 08:52:42,956 INFO
&gt; org.apache.flink.runtime.checkpoint.CheckpointCoordinator&nbsp;&nbsp;&nbsp; [] - Completed
&gt; checkpoint 661819 for job 4fa72fc414f53e5ee062f9fbd5a2f4d5 (2233389 bytes
&gt; in 4939 ms).
&gt; 2021-03-18 08:52:43,528 INFO
&gt; org.apache.flink.runtime.checkpoint.CheckpointCoordinator&nbsp;&nbsp;&nbsp; [] -
&gt; Triggering checkpoint 661820 (type=CHECKPOINT) @ 1616028763457 for job
&gt; 4fa72fc414f53e5ee062f9fbd5a2f4d5.
&gt; 2021-03-18 09:12:43,528 INFO
&gt; org.apache.flink.runtime.checkpoint.CheckpointCoordinator&nbsp;&nbsp;&nbsp; [] -
&gt; Checkpoint 661820 of job 4fa72fc414f53e5ee062f9fbd5a2f4d5 expired before
&gt; completing.
&gt; 2021-03-18 09:12:43,615 INFO
&gt; org.apache.flink.runtime.jobmaster.JobMaster&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; [] - Trying to
&gt; recover from a global failure.
&gt; org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable
&gt; failure threshold.
&gt; at
&gt; org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleCheckpointException(CheckpointFailureManager.java:90)
&gt; ~[flink-dist_2.12-1.12.0.jar:1.12.0]
&gt; at
&gt; org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleJobLevelCheckpointException(CheckpointFailureManager.java:65)
&gt; ~[flink-dist_2.12-1.12.0.jar:1.12.0]
&gt; at
&gt; org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1760)
&gt; ~[flink-dist_2.12-1.12.0.jar:1.12.0]
&gt; at
&gt; org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1733)
&gt; ~[flink-dist_2.12-1.12.0.jar:1.12.0]
&gt; at
&gt; org.apache.flink.runtime.checkpoint.CheckpointCoordinator.access$600(CheckpointCoordinator.java:93)
&gt; ~[flink-dist_2.12-1.12.0.jar:1.12.0]
&gt; at
&gt; org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:1870)
&gt; ~[flink-dist_2.12-1.12.0.jar:1.12.0]
&gt; at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
&gt; ~[?:1.8.0_231]
&gt; at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_231]
&gt; at
&gt; java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
&gt; ~[?:1.8.0_231]
&gt; at
&gt; java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
&gt; ~[?:1.8.0_231]
&gt; at
&gt; java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
&gt; ~[?:1.8.0_231]
&gt; at
&gt; java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
&gt; ~[?:1.8.0_231]
&gt; at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_231]
&gt; 2021-03-18 09:12:43,618 INFO
&gt; org.apache.flink.runtime.executiongraph.ExecutionGraph&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; [] - Job
&gt; csmonitor_comment_strategy (4fa72fc414f53e5ee062f9fbd5a2f4d5) switched from
&gt; state RUNNING to RESTARTING.
&gt; 2021-03-18 09:12:43,619 INFO
&gt; org.apache.flink.runtime.executiongraph.ExecutionGraph&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; [] - Flat Map
&gt; (43/256) (18dec1f23b95f741f5266594621971d5) switched from RUNNING to
&gt; CANCELING.
&gt; 2021-03-18 09:12:43,622 INFO
&gt; org.apache.flink.runtime.executiongraph.ExecutionGraph&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; [] - Flat Map
&gt; (44/256) (3f2ec60b2f3042ceea6e1d660c78d3d7) switched from RUNNING to
&gt; CANCELING.
&gt; 2021-03-18 09:12:43,622 INFO
&gt; org.apache.flink.runtime.executiongraph.ExecutionGraph&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; [] - Flat Map
&gt; (45/256) (66d411c2266ab025b69196dfec30d888) switched from RUNNING to
&gt; CANCELING.
&gt; 然后就自己恢复了。用的是Unaligned
&gt; Checkpoint,rocksdb存储后端,在这个错误前后也没有什么其他报错信息。从Checkpoint的metrics看,总是剩最后一个无法完成,调整过parallelism也无法解决问题。
&gt;
&gt; 谢谢大家!
&gt;
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.12.0 隔几个小时Checkpoint就会失败

Congxian Qiu
In reply to this post by Frost Wong
从日志看 checkpoint 超时了,可以尝试看一下是哪个算子的哪个并发没有做完 checkpoint,可以看看这篇文章[1] 能否帮助你

[1] https://www.infoq.cn/article/g8ylv3i2akmmzgccz8ku
Best,
Congxian


Frost Wong <[hidden email]> 于2021年3月18日周四 下午12:28写道:

> 哦哦,我看到了有个
>
> setTolerableCheckpointFailureNumber
>
> 之前不知道有这个方法,倒是可以试一下,不过我就是不太理解为什么会失败,也没有任何报错
> ________________________________
> 发件人: yidan zhao <[hidden email]>
> 发送时间: 2021年3月18日 3:47
> 收件人: user-zh <[hidden email]>
> 主题: Re: Flink 1.12.0 隔几个小时Checkpoint就会失败
>
> 设置下检查点失败不影响任务呀,你这貌似还导致任务重启了?
>
> Frost Wong <[hidden email]> 于2021年3月18日周四 上午10:38写道:
>
> > Hi 大家好
> >
> > 我用的Flink on yarn模式运行的一个任务,每隔几个小时就会出现一次错误
> >
> > 2021-03-18 08:52:37,019 INFO
> > org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] -
> Completed
> > checkpoint 661818 for job 4fa72fc414f53e5ee062f9fbd5a2f4d5 (562357 bytes
> in
> > 4699 ms).
> > 2021-03-18 08:52:37,637 INFO
> > org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] -
> > Triggering checkpoint 661819 (type=CHECKPOINT) @ 1616028757520 for job
> > 4fa72fc414f53e5ee062f9fbd5a2f4d5.
> > 2021-03-18 08:52:42,956 INFO
> > org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] -
> Completed
> > checkpoint 661819 for job 4fa72fc414f53e5ee062f9fbd5a2f4d5 (2233389 bytes
> > in 4939 ms).
> > 2021-03-18 08:52:43,528 INFO
> > org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] -
> > Triggering checkpoint 661820 (type=CHECKPOINT) @ 1616028763457 for job
> > 4fa72fc414f53e5ee062f9fbd5a2f4d5.
> > 2021-03-18 09:12:43,528 INFO
> > org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] -
> > Checkpoint 661820 of job 4fa72fc414f53e5ee062f9fbd5a2f4d5 expired before
> > completing.
> > 2021-03-18 09:12:43,615 INFO
> > org.apache.flink.runtime.jobmaster.JobMaster                 [] - Trying
> to
> > recover from a global failure.
> > org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint
> tolerable
> > failure threshold.
> > at
> >
> org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleCheckpointException(CheckpointFailureManager.java:90)
> > ~[flink-dist_2.12-1.12.0.jar:1.12.0]
> > at
> >
> org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleJobLevelCheckpointException(CheckpointFailureManager.java:65)
> > ~[flink-dist_2.12-1.12.0.jar:1.12.0]
> > at
> >
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1760)
> > ~[flink-dist_2.12-1.12.0.jar:1.12.0]
> > at
> >
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1733)
> > ~[flink-dist_2.12-1.12.0.jar:1.12.0]
> > at
> >
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.access$600(CheckpointCoordinator.java:93)
> > ~[flink-dist_2.12-1.12.0.jar:1.12.0]
> > at
> >
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:1870)
> > ~[flink-dist_2.12-1.12.0.jar:1.12.0]
> > at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> > ~[?:1.8.0_231]
> > at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> ~[?:1.8.0_231]
> > at
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> > ~[?:1.8.0_231]
> > at
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> > ~[?:1.8.0_231]
> > at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > ~[?:1.8.0_231]
> > at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > ~[?:1.8.0_231]
> > at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_231]
> > 2021-03-18 09:12:43,618 INFO
> > org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job
> > csmonitor_comment_strategy (4fa72fc414f53e5ee062f9fbd5a2f4d5) switched
> from
> > state RUNNING to RESTARTING.
> > 2021-03-18 09:12:43,619 INFO
> > org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Flat
> Map
> > (43/256) (18dec1f23b95f741f5266594621971d5) switched from RUNNING to
> > CANCELING.
> > 2021-03-18 09:12:43,622 INFO
> > org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Flat
> Map
> > (44/256) (3f2ec60b2f3042ceea6e1d660c78d3d7) switched from RUNNING to
> > CANCELING.
> > 2021-03-18 09:12:43,622 INFO
> > org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Flat
> Map
> > (45/256) (66d411c2266ab025b69196dfec30d888) switched from RUNNING to
> > CANCELING.
> > 然后就自己恢复了。用的是Unaligned
> >
> Checkpoint,rocksdb存储后端,在这个错误前后也没有什么其他报错信息。从Checkpoint的metrics看,总是剩最后一个无法完成,调整过parallelism也无法解决问题。
> >
> > 谢谢大家!
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.12.0 隔几个小时Checkpoint就会失败

Haihang Jing
In reply to this post by Frost Wong
你好,问题定位到了吗?
我也遇到了相同的问题,感觉和checkpoint interval有关
我有两个相同的作业(checkpoint interval
设置的是3分钟),一个运行在flink1.9,一个运行在flink1.12,1.9的作业稳定运行,1.12的运行5小时就会checkpoint
制作失败,抛异常 org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint
tolerable failure threshold.
当我把checkpoint interval调大到10分钟后,1.12的作业也可以稳定运行,所以我怀疑和制作间隔有关。
看到过一个issuse,了解到flink1.10后对于checkpoint机制进行调整,接收端在barrier对齐时不会缓存单个barrier到达后的数据,意味着发送方必须在barrier对齐后等待credit
feedback来传输数据,因此发送方会产生一定的冷启动,影响到延迟和网络吞吐量。但是不确定是不是一定和这个相关,以及如何定位影响。



--
Sent from: http://apache-flink.147419.n8.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.12.0 隔几个小时Checkpoint就会失败

张锴
你好,我也遇到了这个问题,你的checkpoint是怎么配置的,可以参考一下吗

Haihang Jing <[hidden email]> 于2021年3月23日周二 下午8:04写道:

> 你好,问题定位到了吗?
> 我也遇到了相同的问题,感觉和checkpoint interval有关
> 我有两个相同的作业(checkpoint interval
> 设置的是3分钟),一个运行在flink1.9,一个运行在flink1.12,1.9的作业稳定运行,1.12的运行5小时就会checkpoint
> 制作失败,抛异常 org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint
> tolerable failure threshold.
> 当我把checkpoint interval调大到10分钟后,1.12的作业也可以稳定运行,所以我怀疑和制作间隔有关。
>
> 看到过一个issuse,了解到flink1.10后对于checkpoint机制进行调整,接收端在barrier对齐时不会缓存单个barrier到达后的数据,意味着发送方必须在barrier对齐后等待credit
> feedback来传输数据,因此发送方会产生一定的冷启动,影响到延迟和网络吞吐量。但是不确定是不是一定和这个相关,以及如何定位影响。
>
>
>
> --
> Sent from: http://apache-flink.147419.n8.nabble.com/
Reply | Threaded
Open this post in threaded view
|

回复:Flink 1.12.0 隔几个小时Checkpoint就会失败

tianxy
唉,这个问题着实让人头大,我现在还没找到原因。你这边确定了跟我说一声哈😊


| |
田向阳
|
|
邮箱:[hidden email]
|

签名由 网易邮箱大师 定制

在2021年04月22日 20:56,张锴 写道:
你好,我也遇到了这个问题,你的checkpoint是怎么配置的,可以参考一下吗

Haihang Jing <[hidden email]> 于2021年3月23日周二 下午8:04写道:

> 你好,问题定位到了吗?
> 我也遇到了相同的问题,感觉和checkpoint interval有关
> 我有两个相同的作业(checkpoint interval
> 设置的是3分钟),一个运行在flink1.9,一个运行在flink1.12,1.9的作业稳定运行,1.12的运行5小时就会checkpoint
> 制作失败,抛异常 org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint
> tolerable failure threshold.
> 当我把checkpoint interval调大到10分钟后,1.12的作业也可以稳定运行,所以我怀疑和制作间隔有关。
>
> 看到过一个issuse,了解到flink1.10后对于checkpoint机制进行调整,接收端在barrier对齐时不会缓存单个barrier到达后的数据,意味着发送方必须在barrier对齐后等待credit
> feedback来传输数据,因此发送方会产生一定的冷启动,影响到延迟和网络吞吐量。但是不确定是不是一定和这个相关,以及如何定位影响。
>
>
>
> --
> Sent from: http://apache-flink.147419.n8.nabble.com/