flink 1.9.2 升级 1.10.0 任务失败不能从checkpoint恢复

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: flink 1.9.2 升级 1.10.0 任务失败不能从checkpoint恢复

Congxian Qiu
Hi Peihui
   不确定是什么原因我这边暂时没看到附件,我再私聊你要一下具体的 log 然后看看

Best,
Congxian


Peihui He <[hidden email]> 于2020年7月23日周四 上午8:57写道:

> Hi Congxian,
>
> 这个问题有结论没呢?
>
> Best wishes.
>
> Peihui He <[hidden email]> 于2020年7月17日周五 下午4:21写道:
>
>> Hi Congxian,
>>
>> [image: Snipaste_2020-07-17_16-20-06.png]
>>
>> 我这边通过chrome 浏览器看到是上传了的,并且可以下载的。
>>
>> Best wishes.
>>
>> Congxian Qiu <[hidden email]> 于2020年7月17日周五 下午1:31写道:
>>
>>> Hi  Peihui
>>>
>>> 感谢你的回复,我这边没有看到附件,你那边能否确认下呢?
>>>
>>> Best,
>>> Congxian
>>>
>>>
>>> Peihui He <[hidden email]> 于2020年7月17日周五 上午10:13写道:
>>>
>>> > Hi Congxian
>>> >
>>> > 见附件。
>>> >
>>> > Best wishes.
>>> >
>>> > Congxian Qiu <[hidden email]> 于2020年7月16日周四 下午8:24写道:
>>> >
>>> >> Hi Peihui
>>> >>
>>> >> 感谢你的回信。能否帮忙用 1.10.0 复现一次,然后把相关的日志(JM log 和 TM Log,方便的话,也开启一下 debug
>>> >> 日志)分享一下呢?如果日志太大的话,可以尝试贴待 gist[1] 然后邮件列表回复一个地址即可,
>>> >> 非常感谢~
>>> >>
>>> >> [1] https://gist.github.com/
>>> >>
>>> >> Best,
>>> >> Congxian
>>> >>
>>> >>
>>> >> Peihui He <[hidden email]> 于2020年7月16日周四 下午5:54写道:
>>> >>
>>> >> > Hi Yun,
>>> >> >
>>> >> > 我这边测试需要在集群上跑的,本地idea跑是没有问题的。
>>> >> > flink 1.10.1 的flink-conf.yaml 是cope flink 1.10.0 的,但是1.10.0 就是报错。
>>> >> >
>>> >> > 附件就是源码job。如果你要的跑需要改下socket host的。只要socket 中输入hepeihui 就会抛异常的。
>>> >> >
>>> >> > Peihui He <[hidden email]> 于2020年7月16日周四 下午5:26写道:
>>> >> >
>>> >> >> Hi Yun,
>>> >> >>
>>> >> >> 作业没有开启local recovery, 我这边测试1.10.0是必现的。
>>> >> >>
>>> >> >> Best wishes.
>>> >> >>
>>> >> >> Yun Tang <[hidden email]> 于2020年7月16日周四 下午5:04写道:
>>> >> >>
>>> >> >>> Hi Peihui
>>> >> >>>
>>> >> >>> Flink-1.10.1
>>> >> >>>
>>> 里面涉及到相关代码的改动就是更改了restore时path的类[1],但是你们的操作系统并不是windows,按道理应该是没有关系的。
>>> >> >>> 另外,这个问题在你遇到failover时候是必现的么?从文件路径看,作业也没有开启local recovery是吧?
>>> >> >>>
>>> >> >>>
>>> >> >>> [1]
>>> >> >>>
>>> >>
>>> https://github.com/apache/flink/commit/399329275e5e2baca9ed9494cce97ff732ac077a
>>> >> >>> 祝好
>>> >> >>> 唐云
>>> >> >>> ________________________________
>>> >> >>> From: Peihui He <[hidden email]>
>>> >> >>> Sent: Thursday, July 16, 2020 16:15
>>> >> >>> To: [hidden email] <[hidden email]>
>>> >> >>> Subject: Re: flink 1.9.2 升级 1.10.0 任务失败不能从checkpoint恢复
>>> >> >>>
>>> >> >>> Hi Yun,
>>> >> >>>
>>> >> >>> 不好意思这么久回复,是@Congxian 描述的第2种情况。异常就是我通过socket
>>> >> >>> 输入的特定的word抛出runtimeexception 使task
>>> >> >>> 失败,然后job会尝试从checkpoint中恢复,但是恢复的过程中就报
>>> >> >>>
>>> >> >>> Caused by: java.nio.file.NoSuchFileException:
>>> >> >>>
>>> >> >>>
>>> >>
>>> /data/hadoop/yarn/local/usercache/hdfs/appcache/application_1589438582606_30760/flink-io-26af2be2-2b14-4eab-90d8-9ebb32ace6e3/job_6b6cacb02824b8521808381113f57eff_op_StreamGroupedReduce_54cc3719665e6629c9000e9308537a5e__1_1__uuid_afda2b8b-0b79-449e-88b5-c34c27c1a079/db/000009.sst
>>> >> >>> ->
>>> >> >>>
>>> >>
>>> /data/hadoop/yarn/local/usercache/hdfs/appcache/application_1589438582606_30760/flink-io-26af2be2-2b14-4eab-90d8-9ebb32ace6e3/job_6b6cacb02824b8521808381113f57eff_op_StreamGroupedReduce_54cc3719665e6629c9000e9308537a5e__1_1__uuid_afda2b8b-0b79-449e-88b5-c34c27c1a079/8f609663-4fbb-483f-83c0-de04654310f7/000009.sst
>>> >> >>>
>>> >> >>> 情况和@chenxyz 类似。
>>> >> >>>
>>> >> >>>
>>> >>
>>> http://apache-flink.147419.n8.nabble.com/rocksdb-Could-not-restore-keyed-state-backend-for-KeyedProcessOperator-td2232.html
>>> >> >>>
>>> >> >>> 换成1.10.1 就可以了
>>> >> >>>
>>> >> >>> Best wishes.
>>> >> >>>
>>> >> >>> Yun Tang <[hidden email]> 于2020年7月15日周三 下午4:35写道:
>>> >> >>>
>>> >> >>> > Hi Robin
>>> >> >>> >
>>> >> >>> > 其实你的说法不是很准确,社区是明文保证savepoint的兼容性
>>> >> >>> >
>>> >> >>>
>>> >>
>>> [1],但是并不意味着跨大版本时无法从checkpoint恢复,社区不承诺主要还是维护其太耗费精力,但是实际从代码角度来说,在合理使用state
>>> >> >>> > schema evolution [2]的前提下,目前跨版本checkpoint恢复基本都是兼容的.
>>> >> >>> >
>>> >> >>> > 另外 @Peihui 也请麻烦对你的异常描述清晰一些,我的第一次回复已经推测该异常不是root
>>> >> >>> cause,还请在日志中找一下无法恢复的root
>>> >> >>> > cause,如果不知道怎么从日志里面找,可以把相关日志分享出来。
>>> >> >>> >
>>> >> >>> >
>>> >> >>> > [1]
>>> >> >>> >
>>> >> >>>
>>> >>
>>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/upgrading.html#compatibility-table
>>> >> >>> > [2]
>>> >> >>> >
>>> >> >>>
>>> >>
>>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/schema_evolution.html
>>> >> >>> >
>>> >> >>> > 祝好
>>> >> >>> > 唐云
>>> >> >>> >
>>> >> >>> >
>>> >> >>> > ________________________________
>>> >> >>> > From: Robin Zhang <[hidden email]>
>>> >> >>> > Sent: Wednesday, July 15, 2020 16:23
>>> >> >>> > To: [hidden email] <[hidden email]>
>>> >> >>> > Subject: Re: flink 1.9.2 升级 1.10.0 任务失败不能从checkpoint恢复
>>> >> >>> >
>>> >> >>> > 据我所知,跨大版本的不能直接从checkoint恢复,只能放弃状态重新跑
>>> >> >>> >
>>> >> >>> > Best
>>> >> >>> > Robin Zhang
>>> >> >>> > ________________________________
>>> >> >>> > From: Peihui He <[hidden email]>
>>> >> >>> > Sent: Tuesday, July 14, 2020 10:42
>>> >> >>> > To: [hidden email] <[hidden email]>
>>> >> >>> > Subject: flink 1.9.2 升级 1.10.0 任务失败不能从checkpoint恢复
>>> >> >>> >
>>> >> >>> > hello,
>>> >> >>> >
>>> >> >>> >         当升级到1.10.0 时候,程序出错后会尝试从checkpoint恢复,但是总是失败,提示
>>> >> >>> >
>>> >> >>> >
>>> >> >>> > Caused by: java.nio.file.NoSuchFileException:
>>> >> >>> >
>>> >> >>> >
>>> >> >>>
>>> >>
>>> /data/hadoop/yarn/local/usercache/hdfs/appcache/application_1589438582606_30760/flink-io-26af2be2-2b14-4eab-90d8-9ebb32ace6e3/job_6b6cacb02824b8521808381113f57eff_op_StreamGroupedReduce_54cc3719665e6629c9000e9308537a5e__1_1__uuid_afda2b8b-0b79-449e-88b5-c34c27c1a079/db/000009.sst
>>> >> >>> > ->
>>> >> >>> >
>>> >> >>> >
>>> >> >>>
>>> >>
>>> /data/hadoop/yarn/local/usercache/hdfs/appcache/application_1589438582606_30760/flink-io-26af2be2-2b14-4eab-90d8-9ebb32ace6e3/job_6b6cacb02824b8521808381113f57eff_op_StreamGroupedReduce_54cc3719665e6629c9000e9308537a5e__1_1__uuid_afda2b8b-0b79-449e-88b5-c34c27c1a079/8f609663-4fbb-483f-83c0-de04654310f7/000009.sst
>>> >> >>> >
>>> >> >>> > 配置和1.9.2 一样:
>>> >> >>> > state.backend: rocksdb
>>> >> >>> > state.checkpoints.dir: hdfs:///flink/checkpoints/wc/
>>> >> >>> > state.savepoints.dir: hdfs:///flink/savepoints/wc/
>>> >> >>> > state.backend.incremental: true
>>> >> >>> >
>>> >> >>> > 代码上都有
>>> >> >>> >
>>> >> >>> > env.enableCheckpointing(10000);
>>> >> >>> >
>>> >> >>> >
>>> >> >>>
>>> >>
>>> env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
>>> >> >>> > env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3,
>>> >> >>> > org.apache.flink.api.common.time.Time.of(10,
>>> TimeUnit.SECONDS)));
>>> >> >>> >
>>> >> >>> >
>>> >> >>> >           是1.10.0 需要做什么特别配置么?
>>> >> >>> >
>>> >> >>> >
>>> >> >>> >
>>> >> >>> > --
>>> >> >>> > Sent from: http://apache-flink.147419.n8.nabble.com/
>>> >> >>> >
>>> >> >>>
>>> >> >>
>>> >>
>>> >
>>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: flink 1.9.2 升级 1.10.0 任务失败不能从checkpoint恢复

LiangbinZhang
In reply to this post by Yun Tang
   Hi,Tang老师,       抱歉,之前理解有误,感谢唐老师指正。        祝好,Robin
Zhang____________________________________________________________________
Yun Tang wrote
> Hi Robin其实你的说法不是很准确,社区是明文保证savepoint的兼容性
> [1],但是并不意味着跨大版本时无法从checkpoint恢复,社区不承诺主要还是维护其太耗费精力,但是实际从代码角度来说,在合理使用state
> schema evolution [2]的前提下,目前跨版本checkpoint恢复基本都是兼容的.另外 @Peihui
> 也请麻烦对你的异常描述清晰一些,我的第一次回复已经推测该异常不是root cause,还请在日志中找一下无法恢复的root
> cause,如果不知道怎么从日志里面找,可以把相关日志分享出来。[1]
> https://ci.apache.org/projects/flink/flink-docs-stable/ops/upgrading.html#compatibility-table[2]
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/schema_evolution.html祝好唐云________________________________From:
> Robin Zhang &lt;

> vincent2015qdlg@

> &gt;Sent: Wednesday, July 15, 2020 16:23To:

> user-zh@.apache

>  &lt;

> user-zh@.apache

> &gt;Subject: Re: flink 1.9.2 升级 1.10.0
> 任务失败不能从checkpoint恢复据我所知,跨大版本的不能直接从checkoint恢复,只能放弃状态重新跑BestRobin
> Zhang________________________________From: Peihui He <[hidden email]>Sent:
> Tuesday, July 14, 2020 10:42To: [hidden email] <[hidden email]>Subject:
> flink 1.9.2 升级 1.10.0 任务失败不能从checkpoint恢复hello,        当升级到1.10.0
> 时候,程序出错后会尝试从checkpoint恢复,但是总是失败,提示Caused by:
> java.nio.file.NoSuchFileException:/data/hadoop/yarn/local/usercache/hdfs/appcache/application_1589438582606_30760/flink-io-26af2be2-2b14-4eab-90d8-9ebb32ace6e3/job_6b6cacb02824b8521808381113f57eff_op_StreamGroupedReduce_54cc3719665e6629c9000e9308537a5e__1_1__uuid_afda2b8b-0b79-449e-88b5-c34c27c1a079/db/000009.sst->/data/hadoop/yarn/local/usercache/hdfs/appcache/application_1589438582606_30760/flink-io-26af2be2-2b14-4eab-90d8-9ebb32ace6e3/job_6b6cacb02824b8521808381113f57eff_op_StreamGroupedReduce_54cc3719665e6629c9000e9308537a5e__1_1__uuid_afda2b8b-0b79-449e-88b5-c34c27c1a079/8f609663-4fbb-483f-83c0-de04654310f7/000009.sst配置和1.9.2
> 一样:state.backend: rocksdbstate.checkpoints.dir:
> hdfs:///flink/checkpoints/wc/state.savepoints.dir:
> hdfs:///flink/savepoints/wc/state.backend.incremental:
> true代码上都有env.enableCheckpointing(10000);env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3,org.apache.flink.api.common.time.Time.of(10,
> TimeUnit.SECONDS)));          是1.10.0 需要做什么特别配置么?--Sent from:
> http://apache-flink.147419.n8.nabble.com/





--
Sent from: http://apache-flink.147419.n8.nabble.com/
12