hi,all:
我根据这篇博客https://blog.csdn.net/cndotaci/article/details/106870413的介绍,配置了flink基于yarn的高可用,测试时发现配置的任务失败重试2次没有生效,我测试到第6次时,任务仍然能够被yarn拉起。 请问各位大佬 1. 下面配置中的重试次数为什么没有生效? 2. 通过HA拉起的任务,是否可以重用上次任务失败时的state? flink版本:1.10.0 flink-conf.yaml配置: $ grep -v ^# flink-conf.yaml |grep -v ^$ jobmanager.rpc.address: localhost jobmanager.rpc.port: 6123 jobmanager.heap.size: 1024m taskmanager.memory.process.size: 1568m taskmanager.numberOfTaskSlots: 1 parallelism.default: 1 high-availability: zookeeper high-availability.storageDir: hdfs:///flink/ha/ high-availability.zookeeper.quorum: uhadoop-op3raf-master1,uhadoop-op3raf-master2,uhadoop-op3raf-core1 state.checkpoints.dir: hdfs:///flink/checkpoint state.savepoints.dir: hdfs:///flink/flink-savepoints state.checkpoints.num-retained:60 state.backend.incremental: true jobmanager.execution.failover-strategy: region jobmanager.archive.fs.dir: hdfs:///flink/flink-jobs/ historyserver.web.port: 8082 historyserver.archive.fs.dir: hdfs:///flink/flink-jobs/ historyserver.archive.fs.refresh-interval: 10000 # HA重试次数 yarn.application-attempts: 2 ssh到jm节点,手动kill任务的操作日志: [root@uhadoop-op3raf-task48 ~]# jps 34785 YarnTaskExecutorRunner 16853 YarnTaskExecutorRunner 17527 PrestoServer 33289 YarnTaskExecutorRunner 18026 YarnJobClusterEntrypoint 20283 Jps 39599 NodeManager [root@uhadoop-op3raf-task48 ~]# kill -9 18026 [root@uhadoop-op3raf-task48 ~]# jps 34785 YarnTaskExecutorRunner 16853 -- process information unavailable 17527 PrestoServer 21383 Jps 33289 YarnTaskExecutorRunner 20412 YarnJobClusterEntrypoint 39599 NodeManager [root@uhadoop-op3raf-task48 ~]# kill -9 20412 [root@uhadoop-op3raf-task48 ~]# jps 34785 YarnTaskExecutorRunner 21926 YarnJobClusterEntrypoint 23207 Jps 17527 PrestoServer 33289 YarnTaskExecutorRunner 39599 NodeManager [root@uhadoop-op3raf-task48 ~]# kill -9 21926 [root@uhadoop-op3raf-task48 ~]# jps 34785 YarnTaskExecutorRunner 23318 YarnJobClusterEntrypoint 26279 Jps 17527 PrestoServer 33289 YarnTaskExecutorRunner 39599 NodeManager [root@uhadoop-op3raf-task48 ~]# kill -9 23318 |
hi, muchen
1. yarn.application-attempts 这个参数与另外一个参数有关系:yarn.application-attempt-failures-validity-interval,大概意思是需要在设置的这个interval内失败重试多少次,才认为flink job是失败的,如果超过这个interval,就会重新开始计数。打个比方,yarn.application-attempts: 2,yarn.application-attempt-failures-validity-interval = 10000(默认值,10s),只有在10s内 flink job 失败重启2次才会真正的失败。 2. 如果配置了checkpoint是会重用上次任务失败的state。 这是我个人的理解,有疑问大家一起讨论 MuChen <[hidden email]> 于2020年7月1日周三 下午7:50写道: > hi,all: > > 我根据这篇博客https://blog.csdn.net/cndotaci/article/details/106870413 > 的介绍,配置了flink基于yarn的高可用,测试时发现配置的任务失败重试2次没有生效,我测试到第6次时,任务仍然能够被yarn拉起。 > > 请问各位大佬 > > 1. 下面配置中的重试次数为什么没有生效? > > 2. 通过HA拉起的任务,是否可以重用上次任务失败时的state? > > flink版本:1.10.0 > > flink-conf.yaml配置: > $ grep -v ^# flink-conf.yaml |grep -v ^$ jobmanager.rpc.address: localhost > jobmanager.rpc.port: 6123 jobmanager.heap.size: 1024m > taskmanager.memory.process.size: 1568m taskmanager.numberOfTaskSlots: 1 > parallelism.default: 1 high-availability: zookeeper > high-availability.storageDir: hdfs:///flink/ha/ > high-availability.zookeeper.quorum: > uhadoop-op3raf-master1,uhadoop-op3raf-master2,uhadoop-op3raf-core1 > state.checkpoints.dir: hdfs:///flink/checkpoint state.savepoints.dir: > hdfs:///flink/flink-savepoints state.checkpoints.num-retained:60 > state.backend.incremental: true jobmanager.execution.failover-strategy: > region jobmanager.archive.fs.dir: hdfs:///flink/flink-jobs/ > historyserver.web.port: 8082 historyserver.archive.fs.dir: > hdfs:///flink/flink-jobs/ historyserver.archive.fs.refresh-interval: 10000 > # HA重试次数 yarn.application-attempts: 2 > ssh到jm节点,手动kill任务的操作日志: > [root@uhadoop-op3raf-task48 ~]# jps 34785 YarnTaskExecutorRunner 16853 > YarnTaskExecutorRunner 17527 PrestoServer 33289 YarnTaskExecutorRunner > 18026 YarnJobClusterEntrypoint 20283 Jps 39599 NodeManager > [root@uhadoop-op3raf-task48 ~]# kill -9 18026 [root@uhadoop-op3raf-task48 > ~]# jps 34785 YarnTaskExecutorRunner 16853 -- process information > unavailable 17527 PrestoServer 21383 Jps 33289 YarnTaskExecutorRunner 20412 > YarnJobClusterEntrypoint 39599 NodeManager [root@uhadoop-op3raf-task48 > ~]# kill -9 20412 [root@uhadoop-op3raf-task48 ~]# jps 34785 > YarnTaskExecutorRunner 21926 YarnJobClusterEntrypoint 23207 Jps 17527 > PrestoServer 33289 YarnTaskExecutorRunner 39599 NodeManager > [root@uhadoop-op3raf-task48 ~]# kill -9 21926 [root@uhadoop-op3raf-task48 > ~]# jps 34785 YarnTaskExecutorRunner 23318 YarnJobClusterEntrypoint 26279 > Jps 17527 PrestoServer 33289 YarnTaskExecutorRunner 39599 NodeManager > [root@uhadoop-op3raf-task48 ~]# kill -9 23318 |
hi,王松:
受教了,多谢指点! Best, MuChen. ------------------ 原始邮件 ------------------ 发件人: "王松"<[hidden email]>; 发送时间: 2020年7月1日(星期三) 晚上8:17 收件人: "user-zh"<[hidden email]>; 主题: Re: flink基于yarn的HA次数无效,以及HA拉起的任务是否可以重用state hi, muchen 1. yarn.application-attempts 这个参数与另外一个参数有关系:yarn.application-attempt-failures-validity-interval,大概意思是需要在设置的这个interval内失败重试多少次,才认为flink job是失败的,如果超过这个interval,就会重新开始计数。打个比方,yarn.application-attempts: 2,yarn.application-attempt-failures-validity-interval = 10000(默认值,10s),只有在10s内 flink job 失败重启2次才会真正的失败。 2. 如果配置了checkpoint是会重用上次任务失败的state。 这是我个人的理解,有疑问大家一起讨论 MuChen <[hidden email]> 于2020年7月1日周三 下午7:50写道: > hi,all: > > 我根据这篇博客https://blog.csdn.net/cndotaci/article/details/106870413 > 的介绍,配置了flink基于yarn的高可用,测试时发现配置的任务失败重试2次没有生效,我测试到第6次时,任务仍然能够被yarn拉起。 > > 请问各位大佬 > > 1. 下面配置中的重试次数为什么没有生效? > > 2. 通过HA拉起的任务,是否可以重用上次任务失败时的state? > > flink版本:1.10.0 > > flink-conf.yaml配置: > $ grep -v ^# flink-conf.yaml |grep -v ^$ jobmanager.rpc.address: localhost > jobmanager.rpc.port: 6123 jobmanager.heap.size: 1024m > taskmanager.memory.process.size: 1568m taskmanager.numberOfTaskSlots: 1 > parallelism.default: 1 high-availability: zookeeper > high-availability.storageDir: hdfs:///flink/ha/ > high-availability.zookeeper.quorum: > uhadoop-op3raf-master1,uhadoop-op3raf-master2,uhadoop-op3raf-core1 > state.checkpoints.dir: hdfs:///flink/checkpoint state.savepoints.dir: > hdfs:///flink/flink-savepoints state.checkpoints.num-retained:60 > state.backend.incremental: true jobmanager.execution.failover-strategy: > region jobmanager.archive.fs.dir: hdfs:///flink/flink-jobs/ > historyserver.web.port: 8082 historyserver.archive.fs.dir: > hdfs:///flink/flink-jobs/ historyserver.archive.fs.refresh-interval: 10000 > # HA重试次数 yarn.application-attempts: 2 > ssh到jm节点,手动kill任务的操作日志: > [root@uhadoop-op3raf-task48 ~]# jps 34785 YarnTaskExecutorRunner 16853 > YarnTaskExecutorRunner 17527 PrestoServer 33289 YarnTaskExecutorRunner > 18026 YarnJobClusterEntrypoint 20283 Jps 39599 NodeManager > [root@uhadoop-op3raf-task48 ~]# kill -9 18026 [root@uhadoop-op3raf-task48 > ~]# jps 34785 YarnTaskExecutorRunner 16853 -- process information > unavailable 17527 PrestoServer 21383 Jps 33289 YarnTaskExecutorRunner 20412 > YarnJobClusterEntrypoint 39599 NodeManager [root@uhadoop-op3raf-task48 > ~]# kill -9 20412 [root@uhadoop-op3raf-task48 ~]# jps 34785 > YarnTaskExecutorRunner 21926 YarnJobClusterEntrypoint 23207 Jps 17527 > PrestoServer 33289 YarnTaskExecutorRunner 39599 NodeManager > [root@uhadoop-op3raf-task48 ~]# kill -9 21926 [root@uhadoop-op3raf-task48 > ~]# jps 34785 YarnTaskExecutorRunner 23318 YarnJobClusterEntrypoint 26279 > Jps 17527 PrestoServer 33289 YarnTaskExecutorRunner 39599 NodeManager > [root@uhadoop-op3raf-task48 ~]# kill -9 23318 |
Free forum by Nabble | Edit this page |