Apache Flink 中文用户邮件列表

回复：流处理任务中checkpoint失败

Classic

List

Threaded

2 messages Options

Robert.Zhang

回复：流处理任务中checkpoint失败

看了日志，是由于部分checkpoint 超时未完成，web界面上 iteration source的checkpoint始终无法完成。
官方文档对于在iterative stream的checkpoint没有更详细的说明。对于loop中的数据丢失可以理解。但是checkpoint无法成功不是特别能理解。
按照我对于chandylamport算法的理解，上游operator的barrier应该是直接给到了下游
，不应该存在无法拿到barrier的情况才对。不知道这是什么原因导致的

---原始邮件---
发件人: "Congxian Qiu"<[hidden email]>
发送时间: 2020年8月24日(周一) 晚上8:21
收件人: "user-zh"<[hidden email]>;
主题: Re: 流处理任务中checkpoint失败

Hi
   从报错 ”Exceeded checkpoint tolerable failure threshold“ 看，你的 checkpoint
有持续报错，导致了作业失败，你需要找一下为什么 checkpoint 失败，或许这篇文章[1] 可以有一些帮助
   另外从配置看，你开启了 unalign checkpoint，这个是上述文章中暂时没有设计的地方。

[1] https://zhuanlan.zhihu.com/p/87131964
Best,
Congxian

Robert.Zhang <[hidden email]> 于2020年8月21日周五下午6:31写道：

> Hello all,
> 目前遇到一个问题，在iterative stream job
> 使用checkpoint，按照文档进行了相应的配置，测试过程中checkpoint几乎无法成功
> 测试state 很小，只有几k，依然无法成功。会出现org.apache.flink.util.FlinkRuntimeException:
> Exceeded checkpoint tolerable failure threshold.的报错
>
>
> 配置如下：
> env.enableCheckpointing(10000, CheckpointingMode.EXACTLY_ONCE, true);
> CheckpointConfig checkpointConfig = env.getCheckpointConfig();
> checkpointConfig.setCheckpointTimeout(600000);
> checkpointConfig.setMinPauseBetweenCheckpoints(60000);
> checkpointConfig.setMaxConcurrentCheckpoints(4);
>
> checkpointConfig.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
> checkpointConfig.setPreferCheckpointForRecovery(true);
> checkpointConfig.setTolerableCheckpointFailureNumber(2);
> checkpointConfig.enableUnalignedCheckpoints();
>
>
> 任务只处理几条数据，未存在反压的情况。有遇到类似问题的老哥吗？

Congxian Qiu

Re: 流处理任务中checkpoint失败

Hi
对于 checkpoint 超时失败的情况，需要看一下具体的原因，对于 source 没有完成的话，或许看一下相应并发（没完成 snapshot
的 source）的 CPU 占用情况，以及相应逻辑是否卡在哪里或许能看到一些线索。source 是收到 JM 的 rpc 后触发的
snapshot，所以这里相比其他的算子，不需要考虑 barrier 对齐的事情。
Best,
Congxian

Robert.Zhang <[hidden email]> 于2020年8月25日周二上午12:58写道：

> 看了日志，是由于部分checkpoint 超时未完成，web界面上 iteration source的checkpoint始终无法完成。
> 官方文档对于在iterative
> stream的checkpoint没有更详细的说明。对于loop中的数据丢失可以理解。但是checkpoint无法成功不是特别能理解。
> 按照我对于chandylamport算法的理解，上游operator的barrier应该是直接给到了下游
> ，不应该存在无法拿到barrier的情况才对。不知道这是什么原因导致的
>
> ---原始邮件---
> 发件人: "Congxian Qiu"<[hidden email]>
> 发送时间: 2020年8月24日(周一) 晚上8:21
> 收件人: "user-zh"<[hidden email]>;
> 主题: Re: 流处理任务中checkpoint失败
>
>
> Hi
>    从报错 ”Exceeded checkpoint tolerable failure threshold“ 看，你的
> checkpoint
> 有持续报错，导致了作业失败，你需要找一下为什么 checkpoint 失败，或许这篇文章[1] 可以有一些帮助
>    另外从配置看，你开启了 unalign checkpoint，这个是上述文章中暂时没有设计的地方。
>
> [1] https://zhuanlan.zhihu.com/p/87131964
> Best,
> Congxian
>
>
> Robert.Zhang <[hidden email]> 于2020年8月21日周五下午6:31写道：
>
> > Hello all,
> > 目前遇到一个问题，在iterative stream job
> > 使用checkpoint，按照文档进行了相应的配置，测试过程中checkpoint几乎无法成功
> > 测试state 很小，只有几k，依然无法成功。会出现org.apache.flink.util.FlinkRuntimeException:
> > Exceeded checkpoint tolerable failure threshold.的报错
> >
> >
> > 配置如下：
> > env.enableCheckpointing(10000, CheckpointingMode.EXACTLY_ONCE, true);
> > CheckpointConfig checkpointConfig = env.getCheckpointConfig();
> > checkpointConfig.setCheckpointTimeout(600000);
> > checkpointConfig.setMinPauseBetweenCheckpoints(60000);
> > checkpointConfig.setMaxConcurrentCheckpoints(4);
> >
> >
> checkpointConfig.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
> > checkpointConfig.setPreferCheckpointForRecovery(true);
> > checkpointConfig.setTolerableCheckpointFailureNumber(2);
> > checkpointConfig.enableUnalignedCheckpoints();
> >
> >
> > 任务只处理几条数据，未存在反压的情况。有遇到类似问题的老哥吗？