flink任务运行一段时间checkpoint超时,任务挂掉

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

flink任务运行一段时间checkpoint超时,任务挂掉

jordan95225
Hi,
我现在有一个flink任务,运行一段时间后checkpoint会超时,INFO信息如下:
checkpoint xxx of job xxx expired before completing.
Trying to recover from a global failure.
org.apache.flink.util.FlinkRuntimeException: Excedded checkpoint toerable
failure threshold.
然后我查看了taskmanager日志,在报错之前的日志有一条WARN:
WARN  akka.remote.Remoting                                         [] -
Association to [akka.tcp://flink@hadoop43:38839] with unknown UID is
irrecoverably failed. Address cannot be quarantined without knowing the UID,
gating instead for 50 ms.
这条WARN之后task就开始Attempting to cancel task Source,不知道是因为什么原因,期望收到各位的回复
Best



--
Sent from: http://apache-flink.147419.n8.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: flink任务运行一段时间checkpoint超时,任务挂掉

Congxian Qiu
Hi
   1 你的作业运行的是哪个版本
   2 你作业挂掉应该是 tolerable failure threshold 超了导致的,这个可以在 checkpoint config
中进行配置,这样 checkpoint 失败后不会导致作业失败
   3 如果可以的话,你可以上传一下 jm 和 tm log
Best,
Congxian


jordan95225 <[hidden email]> 于2020年9月7日周一 上午11:05写道:

> Hi,
> 我现在有一个flink任务,运行一段时间后checkpoint会超时,INFO信息如下:
> checkpoint xxx of job xxx expired before completing.
> Trying to recover from a global failure.
> org.apache.flink.util.FlinkRuntimeException: Excedded checkpoint toerable
> failure threshold.
> 然后我查看了taskmanager日志,在报错之前的日志有一条WARN:
> WARN  akka.remote.Remoting                                         [] -
> Association to [akka.tcp://flink@hadoop43:38839] with unknown UID is
> irrecoverably failed. Address cannot be quarantined without knowing the
> UID,
> gating instead for 50 ms.
> 这条WARN之后task就开始Attempting to cancel task Source,不知道是因为什么原因,期望收到各位的回复
> Best
>
>
>
> --
> Sent from: http://apache-flink.147419.n8.nabble.com/
>