Hi Peihui
不确定是什么原因我这边暂时没看到附件,我再私聊你要一下具体的 log 然后看看 Best, Congxian Peihui He <[hidden email]> 于2020年7月23日周四 上午8:57写道: > Hi Congxian, > > 这个问题有结论没呢? > > Best wishes. > > Peihui He <[hidden email]> 于2020年7月17日周五 下午4:21写道: > >> Hi Congxian, >> >> [image: Snipaste_2020-07-17_16-20-06.png] >> >> 我这边通过chrome 浏览器看到是上传了的,并且可以下载的。 >> >> Best wishes. >> >> Congxian Qiu <[hidden email]> 于2020年7月17日周五 下午1:31写道: >> >>> Hi Peihui >>> >>> 感谢你的回复,我这边没有看到附件,你那边能否确认下呢? >>> >>> Best, >>> Congxian >>> >>> >>> Peihui He <[hidden email]> 于2020年7月17日周五 上午10:13写道: >>> >>> > Hi Congxian >>> > >>> > 见附件。 >>> > >>> > Best wishes. >>> > >>> > Congxian Qiu <[hidden email]> 于2020年7月16日周四 下午8:24写道: >>> > >>> >> Hi Peihui >>> >> >>> >> 感谢你的回信。能否帮忙用 1.10.0 复现一次,然后把相关的日志(JM log 和 TM Log,方便的话,也开启一下 debug >>> >> 日志)分享一下呢?如果日志太大的话,可以尝试贴待 gist[1] 然后邮件列表回复一个地址即可, >>> >> 非常感谢~ >>> >> >>> >> [1] https://gist.github.com/ >>> >> >>> >> Best, >>> >> Congxian >>> >> >>> >> >>> >> Peihui He <[hidden email]> 于2020年7月16日周四 下午5:54写道: >>> >> >>> >> > Hi Yun, >>> >> > >>> >> > 我这边测试需要在集群上跑的,本地idea跑是没有问题的。 >>> >> > flink 1.10.1 的flink-conf.yaml 是cope flink 1.10.0 的,但是1.10.0 就是报错。 >>> >> > >>> >> > 附件就是源码job。如果你要的跑需要改下socket host的。只要socket 中输入hepeihui 就会抛异常的。 >>> >> > >>> >> > Peihui He <[hidden email]> 于2020年7月16日周四 下午5:26写道: >>> >> > >>> >> >> Hi Yun, >>> >> >> >>> >> >> 作业没有开启local recovery, 我这边测试1.10.0是必现的。 >>> >> >> >>> >> >> Best wishes. >>> >> >> >>> >> >> Yun Tang <[hidden email]> 于2020年7月16日周四 下午5:04写道: >>> >> >> >>> >> >>> Hi Peihui >>> >> >>> >>> >> >>> Flink-1.10.1 >>> >> >>> >>> 里面涉及到相关代码的改动就是更改了restore时path的类[1],但是你们的操作系统并不是windows,按道理应该是没有关系的。 >>> >> >>> 另外,这个问题在你遇到failover时候是必现的么?从文件路径看,作业也没有开启local recovery是吧? >>> >> >>> >>> >> >>> >>> >> >>> [1] >>> >> >>> >>> >> >>> https://github.com/apache/flink/commit/399329275e5e2baca9ed9494cce97ff732ac077a >>> >> >>> 祝好 >>> >> >>> 唐云 >>> >> >>> ________________________________ >>> >> >>> From: Peihui He <[hidden email]> >>> >> >>> Sent: Thursday, July 16, 2020 16:15 >>> >> >>> To: [hidden email] <[hidden email]> >>> >> >>> Subject: Re: flink 1.9.2 升级 1.10.0 任务失败不能从checkpoint恢复 >>> >> >>> >>> >> >>> Hi Yun, >>> >> >>> >>> >> >>> 不好意思这么久回复,是@Congxian 描述的第2种情况。异常就是我通过socket >>> >> >>> 输入的特定的word抛出runtimeexception 使task >>> >> >>> 失败,然后job会尝试从checkpoint中恢复,但是恢复的过程中就报 >>> >> >>> >>> >> >>> Caused by: java.nio.file.NoSuchFileException: >>> >> >>> >>> >> >>> >>> >> >>> /data/hadoop/yarn/local/usercache/hdfs/appcache/application_1589438582606_30760/flink-io-26af2be2-2b14-4eab-90d8-9ebb32ace6e3/job_6b6cacb02824b8521808381113f57eff_op_StreamGroupedReduce_54cc3719665e6629c9000e9308537a5e__1_1__uuid_afda2b8b-0b79-449e-88b5-c34c27c1a079/db/000009.sst >>> >> >>> -> >>> >> >>> >>> >> >>> /data/hadoop/yarn/local/usercache/hdfs/appcache/application_1589438582606_30760/flink-io-26af2be2-2b14-4eab-90d8-9ebb32ace6e3/job_6b6cacb02824b8521808381113f57eff_op_StreamGroupedReduce_54cc3719665e6629c9000e9308537a5e__1_1__uuid_afda2b8b-0b79-449e-88b5-c34c27c1a079/8f609663-4fbb-483f-83c0-de04654310f7/000009.sst >>> >> >>> >>> >> >>> 情况和@chenxyz 类似。 >>> >> >>> >>> >> >>> >>> >> >>> http://apache-flink.147419.n8.nabble.com/rocksdb-Could-not-restore-keyed-state-backend-for-KeyedProcessOperator-td2232.html >>> >> >>> >>> >> >>> 换成1.10.1 就可以了 >>> >> >>> >>> >> >>> Best wishes. >>> >> >>> >>> >> >>> Yun Tang <[hidden email]> 于2020年7月15日周三 下午4:35写道: >>> >> >>> >>> >> >>> > Hi Robin >>> >> >>> > >>> >> >>> > 其实你的说法不是很准确,社区是明文保证savepoint的兼容性 >>> >> >>> > >>> >> >>> >>> >> >>> [1],但是并不意味着跨大版本时无法从checkpoint恢复,社区不承诺主要还是维护其太耗费精力,但是实际从代码角度来说,在合理使用state >>> >> >>> > schema evolution [2]的前提下,目前跨版本checkpoint恢复基本都是兼容的. >>> >> >>> > >>> >> >>> > 另外 @Peihui 也请麻烦对你的异常描述清晰一些,我的第一次回复已经推测该异常不是root >>> >> >>> cause,还请在日志中找一下无法恢复的root >>> >> >>> > cause,如果不知道怎么从日志里面找,可以把相关日志分享出来。 >>> >> >>> > >>> >> >>> > >>> >> >>> > [1] >>> >> >>> > >>> >> >>> >>> >> >>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/upgrading.html#compatibility-table >>> >> >>> > [2] >>> >> >>> > >>> >> >>> >>> >> >>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/schema_evolution.html >>> >> >>> > >>> >> >>> > 祝好 >>> >> >>> > 唐云 >>> >> >>> > >>> >> >>> > >>> >> >>> > ________________________________ >>> >> >>> > From: Robin Zhang <[hidden email]> >>> >> >>> > Sent: Wednesday, July 15, 2020 16:23 >>> >> >>> > To: [hidden email] <[hidden email]> >>> >> >>> > Subject: Re: flink 1.9.2 升级 1.10.0 任务失败不能从checkpoint恢复 >>> >> >>> > >>> >> >>> > 据我所知,跨大版本的不能直接从checkoint恢复,只能放弃状态重新跑 >>> >> >>> > >>> >> >>> > Best >>> >> >>> > Robin Zhang >>> >> >>> > ________________________________ >>> >> >>> > From: Peihui He <[hidden email]> >>> >> >>> > Sent: Tuesday, July 14, 2020 10:42 >>> >> >>> > To: [hidden email] <[hidden email]> >>> >> >>> > Subject: flink 1.9.2 升级 1.10.0 任务失败不能从checkpoint恢复 >>> >> >>> > >>> >> >>> > hello, >>> >> >>> > >>> >> >>> > 当升级到1.10.0 时候,程序出错后会尝试从checkpoint恢复,但是总是失败,提示 >>> >> >>> > >>> >> >>> > >>> >> >>> > Caused by: java.nio.file.NoSuchFileException: >>> >> >>> > >>> >> >>> > >>> >> >>> >>> >> >>> /data/hadoop/yarn/local/usercache/hdfs/appcache/application_1589438582606_30760/flink-io-26af2be2-2b14-4eab-90d8-9ebb32ace6e3/job_6b6cacb02824b8521808381113f57eff_op_StreamGroupedReduce_54cc3719665e6629c9000e9308537a5e__1_1__uuid_afda2b8b-0b79-449e-88b5-c34c27c1a079/db/000009.sst >>> >> >>> > -> >>> >> >>> > >>> >> >>> > >>> >> >>> >>> >> >>> /data/hadoop/yarn/local/usercache/hdfs/appcache/application_1589438582606_30760/flink-io-26af2be2-2b14-4eab-90d8-9ebb32ace6e3/job_6b6cacb02824b8521808381113f57eff_op_StreamGroupedReduce_54cc3719665e6629c9000e9308537a5e__1_1__uuid_afda2b8b-0b79-449e-88b5-c34c27c1a079/8f609663-4fbb-483f-83c0-de04654310f7/000009.sst >>> >> >>> > >>> >> >>> > 配置和1.9.2 一样: >>> >> >>> > state.backend: rocksdb >>> >> >>> > state.checkpoints.dir: hdfs:///flink/checkpoints/wc/ >>> >> >>> > state.savepoints.dir: hdfs:///flink/savepoints/wc/ >>> >> >>> > state.backend.incremental: true >>> >> >>> > >>> >> >>> > 代码上都有 >>> >> >>> > >>> >> >>> > env.enableCheckpointing(10000); >>> >> >>> > >>> >> >>> > >>> >> >>> >>> >> >>> env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION); >>> >> >>> > env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3, >>> >> >>> > org.apache.flink.api.common.time.Time.of(10, >>> TimeUnit.SECONDS))); >>> >> >>> > >>> >> >>> > >>> >> >>> > 是1.10.0 需要做什么特别配置么? >>> >> >>> > >>> >> >>> > >>> >> >>> > >>> >> >>> > -- >>> >> >>> > Sent from: http://apache-flink.147419.n8.nabble.com/ >>> >> >>> > >>> >> >>> >>> >> >> >>> >> >>> > >>> >> |
In reply to this post by Yun Tang
Hi,Tang老师, 抱歉,之前理解有误,感谢唐老师指正。 祝好,Robin
Zhang____________________________________________________________________ Yun Tang wrote > Hi Robin其实你的说法不是很准确,社区是明文保证savepoint的兼容性 > [1],但是并不意味着跨大版本时无法从checkpoint恢复,社区不承诺主要还是维护其太耗费精力,但是实际从代码角度来说,在合理使用state > schema evolution [2]的前提下,目前跨版本checkpoint恢复基本都是兼容的.另外 @Peihui > 也请麻烦对你的异常描述清晰一些,我的第一次回复已经推测该异常不是root cause,还请在日志中找一下无法恢复的root > cause,如果不知道怎么从日志里面找,可以把相关日志分享出来。[1] > https://ci.apache.org/projects/flink/flink-docs-stable/ops/upgrading.html#compatibility-table[2] > https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/schema_evolution.html祝好唐云________________________________From: > Robin Zhang < > vincent2015qdlg@ > >Sent: Wednesday, July 15, 2020 16:23To: > user-zh@.apache > < > user-zh@.apache > >Subject: Re: flink 1.9.2 升级 1.10.0 > 任务失败不能从checkpoint恢复据我所知,跨大版本的不能直接从checkoint恢复,只能放弃状态重新跑BestRobin > Zhang________________________________From: Peihui He <[hidden email]>Sent: > Tuesday, July 14, 2020 10:42To: [hidden email] <[hidden email]>Subject: > flink 1.9.2 升级 1.10.0 任务失败不能从checkpoint恢复hello, 当升级到1.10.0 > 时候,程序出错后会尝试从checkpoint恢复,但是总是失败,提示Caused by: > java.nio.file.NoSuchFileException:/data/hadoop/yarn/local/usercache/hdfs/appcache/application_1589438582606_30760/flink-io-26af2be2-2b14-4eab-90d8-9ebb32ace6e3/job_6b6cacb02824b8521808381113f57eff_op_StreamGroupedReduce_54cc3719665e6629c9000e9308537a5e__1_1__uuid_afda2b8b-0b79-449e-88b5-c34c27c1a079/db/000009.sst->/data/hadoop/yarn/local/usercache/hdfs/appcache/application_1589438582606_30760/flink-io-26af2be2-2b14-4eab-90d8-9ebb32ace6e3/job_6b6cacb02824b8521808381113f57eff_op_StreamGroupedReduce_54cc3719665e6629c9000e9308537a5e__1_1__uuid_afda2b8b-0b79-449e-88b5-c34c27c1a079/8f609663-4fbb-483f-83c0-de04654310f7/000009.sst配置和1.9.2 > 一样:state.backend: rocksdbstate.checkpoints.dir: > hdfs:///flink/checkpoints/wc/state.savepoints.dir: > hdfs:///flink/savepoints/wc/state.backend.incremental: > true代码上都有env.enableCheckpointing(10000);env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3,org.apache.flink.api.common.time.Time.of(10, > TimeUnit.SECONDS))); 是1.10.0 需要做什么特别配置么?--Sent from: > http://apache-flink.147419.n8.nabble.com/ -- Sent from: http://apache-flink.147419.n8.nabble.com/ |
Free forum by Nabble | Edit this page |