Hi 大家好
我用的Flink on yarn模式运行的一个任务,每隔几个小时就会出现一次错误 2021-03-18 08:52:37,019 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 661818 for job 4fa72fc414f53e5ee062f9fbd5a2f4d5 (562357 bytes in 4699 ms). 2021-03-18 08:52:37,637 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 661819 (type=CHECKPOINT) @ 1616028757520 for job 4fa72fc414f53e5ee062f9fbd5a2f4d5. 2021-03-18 08:52:42,956 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 661819 for job 4fa72fc414f53e5ee062f9fbd5a2f4d5 (2233389 bytes in 4939 ms). 2021-03-18 08:52:43,528 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 661820 (type=CHECKPOINT) @ 1616028763457 for job 4fa72fc414f53e5ee062f9fbd5a2f4d5. 2021-03-18 09:12:43,528 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Checkpoint 661820 of job 4fa72fc414f53e5ee062f9fbd5a2f4d5 expired before completing. 2021-03-18 09:12:43,615 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Trying to recover from a global failure. org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable failure threshold. at org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleCheckpointException(CheckpointFailureManager.java:90) ~[flink-dist_2.12-1.12.0.jar:1.12.0] at org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleJobLevelCheckpointException(CheckpointFailureManager.java:65) ~[flink-dist_2.12-1.12.0.jar:1.12.0] at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1760) ~[flink-dist_2.12-1.12.0.jar:1.12.0] at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1733) ~[flink-dist_2.12-1.12.0.jar:1.12.0] at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.access$600(CheckpointCoordinator.java:93) ~[flink-dist_2.12-1.12.0.jar:1.12.0] at org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:1870) ~[flink-dist_2.12-1.12.0.jar:1.12.0] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_231] at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_231] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) ~[?:1.8.0_231] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ~[?:1.8.0_231] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_231] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_231] at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_231] 2021-03-18 09:12:43,618 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job csmonitor_comment_strategy (4fa72fc414f53e5ee062f9fbd5a2f4d5) switched from state RUNNING to RESTARTING. 2021-03-18 09:12:43,619 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Flat Map (43/256) (18dec1f23b95f741f5266594621971d5) switched from RUNNING to CANCELING. 2021-03-18 09:12:43,622 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Flat Map (44/256) (3f2ec60b2f3042ceea6e1d660c78d3d7) switched from RUNNING to CANCELING. 2021-03-18 09:12:43,622 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Flat Map (45/256) (66d411c2266ab025b69196dfec30d888) switched from RUNNING to CANCELING. 然后就自己恢复了。用的是Unaligned Checkpoint,rocksdb存储后端,在这个错误前后也没有什么其他报错信息。从Checkpoint的metrics看,总是剩最后一个无法完成,调整过parallelism也无法解决问题。 谢谢大家! |
设置下检查点失败不影响任务呀,你这貌似还导致任务重启了?
Frost Wong <[hidden email]> 于2021年3月18日周四 上午10:38写道: > Hi 大家好 > > 我用的Flink on yarn模式运行的一个任务,每隔几个小时就会出现一次错误 > > 2021-03-18 08:52:37,019 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed > checkpoint 661818 for job 4fa72fc414f53e5ee062f9fbd5a2f4d5 (562357 bytes in > 4699 ms). > 2021-03-18 08:52:37,637 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - > Triggering checkpoint 661819 (type=CHECKPOINT) @ 1616028757520 for job > 4fa72fc414f53e5ee062f9fbd5a2f4d5. > 2021-03-18 08:52:42,956 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed > checkpoint 661819 for job 4fa72fc414f53e5ee062f9fbd5a2f4d5 (2233389 bytes > in 4939 ms). > 2021-03-18 08:52:43,528 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - > Triggering checkpoint 661820 (type=CHECKPOINT) @ 1616028763457 for job > 4fa72fc414f53e5ee062f9fbd5a2f4d5. > 2021-03-18 09:12:43,528 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - > Checkpoint 661820 of job 4fa72fc414f53e5ee062f9fbd5a2f4d5 expired before > completing. > 2021-03-18 09:12:43,615 INFO > org.apache.flink.runtime.jobmaster.JobMaster [] - Trying to > recover from a global failure. > org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable > failure threshold. > at > org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleCheckpointException(CheckpointFailureManager.java:90) > ~[flink-dist_2.12-1.12.0.jar:1.12.0] > at > org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleJobLevelCheckpointException(CheckpointFailureManager.java:65) > ~[flink-dist_2.12-1.12.0.jar:1.12.0] > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1760) > ~[flink-dist_2.12-1.12.0.jar:1.12.0] > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1733) > ~[flink-dist_2.12-1.12.0.jar:1.12.0] > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.access$600(CheckpointCoordinator.java:93) > ~[flink-dist_2.12-1.12.0.jar:1.12.0] > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:1870) > ~[flink-dist_2.12-1.12.0.jar:1.12.0] > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > ~[?:1.8.0_231] > at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_231] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > ~[?:1.8.0_231] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > ~[?:1.8.0_231] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > ~[?:1.8.0_231] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > ~[?:1.8.0_231] > at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_231] > 2021-03-18 09:12:43,618 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job > csmonitor_comment_strategy (4fa72fc414f53e5ee062f9fbd5a2f4d5) switched from > state RUNNING to RESTARTING. > 2021-03-18 09:12:43,619 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Flat Map > (43/256) (18dec1f23b95f741f5266594621971d5) switched from RUNNING to > CANCELING. > 2021-03-18 09:12:43,622 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Flat Map > (44/256) (3f2ec60b2f3042ceea6e1d660c78d3d7) switched from RUNNING to > CANCELING. > 2021-03-18 09:12:43,622 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Flat Map > (45/256) (66d411c2266ab025b69196dfec30d888) switched from RUNNING to > CANCELING. > 然后就自己恢复了。用的是Unaligned > Checkpoint,rocksdb存储后端,在这个错误前后也没有什么其他报错信息。从Checkpoint的metrics看,总是剩最后一个无法完成,调整过parallelism也无法解决问题。 > > 谢谢大家! > |
哦哦,我看到了有个
setTolerableCheckpointFailureNumber 之前不知道有这个方法,倒是可以试一下,不过我就是不太理解为什么会失败,也没有任何报错 ________________________________ 发件人: yidan zhao <[hidden email]> 发送时间: 2021年3月18日 3:47 收件人: user-zh <[hidden email]> 主题: Re: Flink 1.12.0 隔几个小时Checkpoint就会失败 设置下检查点失败不影响任务呀,你这貌似还导致任务重启了? Frost Wong <[hidden email]> 于2021年3月18日周四 上午10:38写道: > Hi 大家好 > > 我用的Flink on yarn模式运行的一个任务,每隔几个小时就会出现一次错误 > > 2021-03-18 08:52:37,019 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed > checkpoint 661818 for job 4fa72fc414f53e5ee062f9fbd5a2f4d5 (562357 bytes in > 4699 ms). > 2021-03-18 08:52:37,637 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - > Triggering checkpoint 661819 (type=CHECKPOINT) @ 1616028757520 for job > 4fa72fc414f53e5ee062f9fbd5a2f4d5. > 2021-03-18 08:52:42,956 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed > checkpoint 661819 for job 4fa72fc414f53e5ee062f9fbd5a2f4d5 (2233389 bytes > in 4939 ms). > 2021-03-18 08:52:43,528 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - > Triggering checkpoint 661820 (type=CHECKPOINT) @ 1616028763457 for job > 4fa72fc414f53e5ee062f9fbd5a2f4d5. > 2021-03-18 09:12:43,528 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - > Checkpoint 661820 of job 4fa72fc414f53e5ee062f9fbd5a2f4d5 expired before > completing. > 2021-03-18 09:12:43,615 INFO > org.apache.flink.runtime.jobmaster.JobMaster [] - Trying to > recover from a global failure. > org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable > failure threshold. > at > org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleCheckpointException(CheckpointFailureManager.java:90) > ~[flink-dist_2.12-1.12.0.jar:1.12.0] > at > org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleJobLevelCheckpointException(CheckpointFailureManager.java:65) > ~[flink-dist_2.12-1.12.0.jar:1.12.0] > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1760) > ~[flink-dist_2.12-1.12.0.jar:1.12.0] > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1733) > ~[flink-dist_2.12-1.12.0.jar:1.12.0] > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.access$600(CheckpointCoordinator.java:93) > ~[flink-dist_2.12-1.12.0.jar:1.12.0] > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:1870) > ~[flink-dist_2.12-1.12.0.jar:1.12.0] > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > ~[?:1.8.0_231] > at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_231] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > ~[?:1.8.0_231] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > ~[?:1.8.0_231] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > ~[?:1.8.0_231] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > ~[?:1.8.0_231] > at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_231] > 2021-03-18 09:12:43,618 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job > csmonitor_comment_strategy (4fa72fc414f53e5ee062f9fbd5a2f4d5) switched from > state RUNNING to RESTARTING. > 2021-03-18 09:12:43,619 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Flat Map > (43/256) (18dec1f23b95f741f5266594621971d5) switched from RUNNING to > CANCELING. > 2021-03-18 09:12:43,622 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Flat Map > (44/256) (3f2ec60b2f3042ceea6e1d660c78d3d7) switched from RUNNING to > CANCELING. > 2021-03-18 09:12:43,622 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Flat Map > (45/256) (66d411c2266ab025b69196dfec30d888) switched from RUNNING to > CANCELING. > 然后就自己恢复了。用的是Unaligned > Checkpoint,rocksdb存储后端,在这个错误前后也没有什么其他报错信息。从Checkpoint的metrics看,总是剩最后一个无法完成,调整过parallelism也无法解决问题。 > > 谢谢大家! > |
In reply to this post by nobleyd
------------------ 原始邮件 ------------------
发件人: "user-zh" <[hidden email]>; 发送时间: 2021年3月18日(星期四) 中午11:47 收件人: "user-zh"<[hidden email]>; 主题: Re: Flink 1.12.0 隔几个小时Checkpoint就会失败 设置下检查点失败不影响任务呀,你这貌似还导致任务重启了? Frost Wong <[hidden email]> 于2021年3月18日周四 上午10:38写道: > Hi 大家好 > > 我用的Flink on yarn模式运行的一个任务,每隔几个小时就会出现一次错误 > > 2021-03-18 08:52:37,019 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed > checkpoint 661818 for job 4fa72fc414f53e5ee062f9fbd5a2f4d5 (562357 bytes in > 4699 ms). > 2021-03-18 08:52:37,637 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - > Triggering checkpoint 661819 (type=CHECKPOINT) @ 1616028757520 for job > 4fa72fc414f53e5ee062f9fbd5a2f4d5. > 2021-03-18 08:52:42,956 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed > checkpoint 661819 for job 4fa72fc414f53e5ee062f9fbd5a2f4d5 (2233389 bytes > in 4939 ms). > 2021-03-18 08:52:43,528 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - > Triggering checkpoint 661820 (type=CHECKPOINT) @ 1616028763457 for job > 4fa72fc414f53e5ee062f9fbd5a2f4d5. > 2021-03-18 09:12:43,528 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - > Checkpoint 661820 of job 4fa72fc414f53e5ee062f9fbd5a2f4d5 expired before > completing. > 2021-03-18 09:12:43,615 INFO > org.apache.flink.runtime.jobmaster.JobMaster [] - Trying to > recover from a global failure. > org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable > failure threshold. > at > org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleCheckpointException(CheckpointFailureManager.java:90) > ~[flink-dist_2.12-1.12.0.jar:1.12.0] > at > org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleJobLevelCheckpointException(CheckpointFailureManager.java:65) > ~[flink-dist_2.12-1.12.0.jar:1.12.0] > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1760) > ~[flink-dist_2.12-1.12.0.jar:1.12.0] > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1733) > ~[flink-dist_2.12-1.12.0.jar:1.12.0] > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.access$600(CheckpointCoordinator.java:93) > ~[flink-dist_2.12-1.12.0.jar:1.12.0] > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:1870) > ~[flink-dist_2.12-1.12.0.jar:1.12.0] > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > ~[?:1.8.0_231] > at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_231] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > ~[?:1.8.0_231] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > ~[?:1.8.0_231] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > ~[?:1.8.0_231] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > ~[?:1.8.0_231] > at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_231] > 2021-03-18 09:12:43,618 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job > csmonitor_comment_strategy (4fa72fc414f53e5ee062f9fbd5a2f4d5) switched from > state RUNNING to RESTARTING. > 2021-03-18 09:12:43,619 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Flat Map > (43/256) (18dec1f23b95f741f5266594621971d5) switched from RUNNING to > CANCELING. > 2021-03-18 09:12:43,622 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Flat Map > (44/256) (3f2ec60b2f3042ceea6e1d660c78d3d7) switched from RUNNING to > CANCELING. > 2021-03-18 09:12:43,622 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Flat Map > (45/256) (66d411c2266ab025b69196dfec30d888) switched from RUNNING to > CANCELING. > 然后就自己恢复了。用的是Unaligned > Checkpoint,rocksdb存储后端,在这个错误前后也没有什么其他报错信息。从Checkpoint的metrics看,总是剩最后一个无法完成,调整过parallelism也无法解决问题。 > > 谢谢大家! > |
In reply to this post by Frost Wong
从日志看 checkpoint 超时了,可以尝试看一下是哪个算子的哪个并发没有做完 checkpoint,可以看看这篇文章[1] 能否帮助你
[1] https://www.infoq.cn/article/g8ylv3i2akmmzgccz8ku Best, Congxian Frost Wong <[hidden email]> 于2021年3月18日周四 下午12:28写道: > 哦哦,我看到了有个 > > setTolerableCheckpointFailureNumber > > 之前不知道有这个方法,倒是可以试一下,不过我就是不太理解为什么会失败,也没有任何报错 > ________________________________ > 发件人: yidan zhao <[hidden email]> > 发送时间: 2021年3月18日 3:47 > 收件人: user-zh <[hidden email]> > 主题: Re: Flink 1.12.0 隔几个小时Checkpoint就会失败 > > 设置下检查点失败不影响任务呀,你这貌似还导致任务重启了? > > Frost Wong <[hidden email]> 于2021年3月18日周四 上午10:38写道: > > > Hi 大家好 > > > > 我用的Flink on yarn模式运行的一个任务,每隔几个小时就会出现一次错误 > > > > 2021-03-18 08:52:37,019 INFO > > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - > Completed > > checkpoint 661818 for job 4fa72fc414f53e5ee062f9fbd5a2f4d5 (562357 bytes > in > > 4699 ms). > > 2021-03-18 08:52:37,637 INFO > > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - > > Triggering checkpoint 661819 (type=CHECKPOINT) @ 1616028757520 for job > > 4fa72fc414f53e5ee062f9fbd5a2f4d5. > > 2021-03-18 08:52:42,956 INFO > > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - > Completed > > checkpoint 661819 for job 4fa72fc414f53e5ee062f9fbd5a2f4d5 (2233389 bytes > > in 4939 ms). > > 2021-03-18 08:52:43,528 INFO > > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - > > Triggering checkpoint 661820 (type=CHECKPOINT) @ 1616028763457 for job > > 4fa72fc414f53e5ee062f9fbd5a2f4d5. > > 2021-03-18 09:12:43,528 INFO > > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - > > Checkpoint 661820 of job 4fa72fc414f53e5ee062f9fbd5a2f4d5 expired before > > completing. > > 2021-03-18 09:12:43,615 INFO > > org.apache.flink.runtime.jobmaster.JobMaster [] - Trying > to > > recover from a global failure. > > org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint > tolerable > > failure threshold. > > at > > > org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleCheckpointException(CheckpointFailureManager.java:90) > > ~[flink-dist_2.12-1.12.0.jar:1.12.0] > > at > > > org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleJobLevelCheckpointException(CheckpointFailureManager.java:65) > > ~[flink-dist_2.12-1.12.0.jar:1.12.0] > > at > > > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1760) > > ~[flink-dist_2.12-1.12.0.jar:1.12.0] > > at > > > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1733) > > ~[flink-dist_2.12-1.12.0.jar:1.12.0] > > at > > > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.access$600(CheckpointCoordinator.java:93) > > ~[flink-dist_2.12-1.12.0.jar:1.12.0] > > at > > > org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:1870) > > ~[flink-dist_2.12-1.12.0.jar:1.12.0] > > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > > ~[?:1.8.0_231] > > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > ~[?:1.8.0_231] > > at > > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > > ~[?:1.8.0_231] > > at > > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > > ~[?:1.8.0_231] > > at > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > ~[?:1.8.0_231] > > at > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > ~[?:1.8.0_231] > > at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_231] > > 2021-03-18 09:12:43,618 INFO > > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job > > csmonitor_comment_strategy (4fa72fc414f53e5ee062f9fbd5a2f4d5) switched > from > > state RUNNING to RESTARTING. > > 2021-03-18 09:12:43,619 INFO > > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Flat > Map > > (43/256) (18dec1f23b95f741f5266594621971d5) switched from RUNNING to > > CANCELING. > > 2021-03-18 09:12:43,622 INFO > > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Flat > Map > > (44/256) (3f2ec60b2f3042ceea6e1d660c78d3d7) switched from RUNNING to > > CANCELING. > > 2021-03-18 09:12:43,622 INFO > > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Flat > Map > > (45/256) (66d411c2266ab025b69196dfec30d888) switched from RUNNING to > > CANCELING. > > 然后就自己恢复了。用的是Unaligned > > > Checkpoint,rocksdb存储后端,在这个错误前后也没有什么其他报错信息。从Checkpoint的metrics看,总是剩最后一个无法完成,调整过parallelism也无法解决问题。 > > > > 谢谢大家! > > > |
In reply to this post by Frost Wong
你好,问题定位到了吗?
我也遇到了相同的问题,感觉和checkpoint interval有关 我有两个相同的作业(checkpoint interval 设置的是3分钟),一个运行在flink1.9,一个运行在flink1.12,1.9的作业稳定运行,1.12的运行5小时就会checkpoint 制作失败,抛异常 org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable failure threshold. 当我把checkpoint interval调大到10分钟后,1.12的作业也可以稳定运行,所以我怀疑和制作间隔有关。 看到过一个issuse,了解到flink1.10后对于checkpoint机制进行调整,接收端在barrier对齐时不会缓存单个barrier到达后的数据,意味着发送方必须在barrier对齐后等待credit feedback来传输数据,因此发送方会产生一定的冷启动,影响到延迟和网络吞吐量。但是不确定是不是一定和这个相关,以及如何定位影响。 -- Sent from: http://apache-flink.147419.n8.nabble.com/ |
你好,我也遇到了这个问题,你的checkpoint是怎么配置的,可以参考一下吗
Haihang Jing <[hidden email]> 于2021年3月23日周二 下午8:04写道: > 你好,问题定位到了吗? > 我也遇到了相同的问题,感觉和checkpoint interval有关 > 我有两个相同的作业(checkpoint interval > 设置的是3分钟),一个运行在flink1.9,一个运行在flink1.12,1.9的作业稳定运行,1.12的运行5小时就会checkpoint > 制作失败,抛异常 org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint > tolerable failure threshold. > 当我把checkpoint interval调大到10分钟后,1.12的作业也可以稳定运行,所以我怀疑和制作间隔有关。 > > 看到过一个issuse,了解到flink1.10后对于checkpoint机制进行调整,接收端在barrier对齐时不会缓存单个barrier到达后的数据,意味着发送方必须在barrier对齐后等待credit > feedback来传输数据,因此发送方会产生一定的冷启动,影响到延迟和网络吞吐量。但是不确定是不是一定和这个相关,以及如何定位影响。 > > > > -- > Sent from: http://apache-flink.147419.n8.nabble.com/ |
唉,这个问题着实让人头大,我现在还没找到原因。你这边确定了跟我说一声哈😊
| | 田向阳 | | 邮箱:[hidden email] | 签名由 网易邮箱大师 定制 在2021年04月22日 20:56,张锴 写道: 你好,我也遇到了这个问题,你的checkpoint是怎么配置的,可以参考一下吗 Haihang Jing <[hidden email]> 于2021年3月23日周二 下午8:04写道: > 你好,问题定位到了吗? > 我也遇到了相同的问题,感觉和checkpoint interval有关 > 我有两个相同的作业(checkpoint interval > 设置的是3分钟),一个运行在flink1.9,一个运行在flink1.12,1.9的作业稳定运行,1.12的运行5小时就会checkpoint > 制作失败,抛异常 org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint > tolerable failure threshold. > 当我把checkpoint interval调大到10分钟后,1.12的作业也可以稳定运行,所以我怀疑和制作间隔有关。 > > 看到过一个issuse,了解到flink1.10后对于checkpoint机制进行调整,接收端在barrier对齐时不会缓存单个barrier到达后的数据,意味着发送方必须在barrier对齐后等待credit > feedback来传输数据,因此发送方会产生一定的冷启动,影响到延迟和网络吞吐量。但是不确定是不是一定和这个相关,以及如何定位影响。 > > > > -- > Sent from: http://apache-flink.147419.n8.nabble.com/ |
Free forum by Nabble | Edit this page |