Flink1.11.1版本Application Mode job on K8S集群,too old resource version问题

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink1.11.1版本Application Mode job on K8S集群,too old resource version问题

lichunguang
Flink1.11.1版本job以Application Mode在K8S集群上运行,jobmanager每个小时会重启一次,报错【Fatal error
occurred in
ResourceManager.io.fabric8.kubernetes.client.KubernetesClientException: too
old resource version】

pod重启:
<http://apache-flink.147419.n8.nabble.com/file/t1176/11.jpg>

重启原因:
2020-12-10 07:21:19,290 ERROR
org.apache.flink.kubernetes.KubernetesResourceManager        [] - Fatal
error occurred in ResourceManager.
io.fabric8.kubernetes.client.KubernetesClientException: too old resource
version: 247468999 (248117930)
  at
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)
[flink-dist_2.11-1.11.1.jar:1.11.1]
  at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
[flink-dist_2.11-1.11.1.jar:1.11.1]
  at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
[flink-dist_2.11-1.11.1.jar:1.11.1]
  at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
[flink-dist_2.11-1.11.1.jar:1.11.1]
  at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
[flink-dist_2.11-1.11.1.jar:1.11.1]
  at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
[flink-dist_2.11-1.11.1.jar:1.11.1]
  at
org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
[flink-dist_2.11-1.11.1.jar:1.11.1]
  at
org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
[flink-dist_2.11-1.11.1.jar:1.11.1]
  at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_202]
  at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_202]
  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_202]
2020-12-10 07:21:19,291 ERROR
org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Fatal
error occurred in the cluster entrypoint.
io.fabric8.kubernetes.client.KubernetesClientException: too old resource
version: 247468999 (248117930)
  at
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)
[flink-dist_2.11-1.11.1.jar:1.11.1]
  at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
[flink-dist_2.11-1.11.1.jar:1.11.1]
  at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
[flink-dist_2.11-1.11.1.jar:1.11.1]
  at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
[flink-dist_2.11-1.11.1.jar:1.11.1]
  at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
[flink-dist_2.11-1.11.1.jar:1.11.1]
  at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
[flink-dist_2.11-1.11.1.jar:1.11.1]
  at
org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
[flink-dist_2.11-1.11.1.jar:1.11.1]
  at
org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
[flink-dist_2.11-1.11.1.jar:1.11.1]
  at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_202]
  at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_202]
  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_202]


网上查的原因是因为:
org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient类中212行

@Override
public KubernetesWatch watchPodsAndDoCallback(Map<String, String> labels,
PodCallbackHandler podCallbackHandler) {
                return new KubernetesWatch(
                        this.internalClient.pods()
                                .withLabels(labels)
                                .watch(new KubernetesPodsWatcher(podCallbackHandler)));
}

而ETCD中只会保留一段时间的version信息
【 I think it's standard behavior of Kubernetes to give 410 after some time
during watch. It's usually client's responsibility to handle it. In the
context of a watch, it will return HTTP_GONE when you ask to see changes for
a resourceVersion that is too old - i.e. when it can no longer tell you what
has changed since that version, since too many things have changed. In that
case, you'll need to start again, by not specifying a resourceVersion in
which case the watch will send you the current state of the thing you are
watching and then send updates from that point.】

大家有没遇到相同的问题,是怎么处理的?我有几个处理方式,希望能跟大家一起讨论一下。




--
Sent from: http://apache-flink.147419.n8.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Flink1.11.1版本Application Mode job on K8S集群,too old resource version问题

Yang Wang
我之前在另一个邮件里面回复过,我再拷贝过来。

目前我已经建了一个JIRA来跟进too old resource version的问题[1]

在Flink里面采用了Watcher来监控Pod的状态变化,当Watcher被异常close的时候就会触发fatal
error进而导致JobManager的重启

我这边做过一些具体的测试,在minikube、自建的K8s集群、阿里云ACK集群,稳定运行一周以上都是正常的。这个问题复现是通过重启
K8s的APIServer来做到的。所以我怀疑你那边Pod和APIServer之间的网络是不是不稳定,从而导致这个问题经常出现。


[1]. https://issues.apache.org/jira/browse/FLINK-20417

Best,
Yang

lichunguang <[hidden email]> 于2020年12月21日周一 下午3:51写道:

> Flink1.11.1版本job以Application Mode在K8S集群上运行,jobmanager每个小时会重启一次,报错【Fatal
> error
> occurred in
> ResourceManager.io.fabric8.kubernetes.client.KubernetesClientException: too
> old resource version】
>
> pod重启:
> <http://apache-flink.147419.n8.nabble.com/file/t1176/11.jpg>
>
> 重启原因:
> 2020-12-10 07:21:19,290 ERROR
> org.apache.flink.kubernetes.KubernetesResourceManager        [] - Fatal
> error occurred in ResourceManager.
> io.fabric8.kubernetes.client.KubernetesClientException: too old resource
> version: 247468999 (248117930)
>   at
>
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>   at
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws
> .RealWebSocket.onReadMessage(RealWebSocket.java:323)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>   at
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws
> .WebSocketReader.readMessageFrame(WebSocketReader.java:219)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>   at
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws
> .WebSocketReader.processNextFrame(WebSocketReader.java:105)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>   at
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws
> .RealWebSocket.loopReader(RealWebSocket.java:274)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>   at
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws
> .RealWebSocket$2.onResponse(RealWebSocket.java:214)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>   at
>
> org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>   at
>
> org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>   at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [?:1.8.0_202]
>   at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [?:1.8.0_202]
>   at java.lang.Thread.run(Thread.java:748) [?:1.8.0_202]
> 2020-12-10 07:21:19,291 ERROR
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Fatal
> error occurred in the cluster entrypoint.
> io.fabric8.kubernetes.client.KubernetesClientException: too old resource
> version: 247468999 (248117930)
>   at
>
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>   at
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws
> .RealWebSocket.onReadMessage(RealWebSocket.java:323)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>   at
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws
> .WebSocketReader.readMessageFrame(WebSocketReader.java:219)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>   at
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws
> .WebSocketReader.processNextFrame(WebSocketReader.java:105)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>   at
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws
> .RealWebSocket.loopReader(RealWebSocket.java:274)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>   at
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws
> .RealWebSocket$2.onResponse(RealWebSocket.java:214)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>   at
>
> org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>   at
>
> org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>   at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [?:1.8.0_202]
>   at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [?:1.8.0_202]
>   at java.lang.Thread.run(Thread.java:748) [?:1.8.0_202]
>
>
> 网上查的原因是因为:
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient类中212行
>
> @Override
> public KubernetesWatch watchPodsAndDoCallback(Map<String, String> labels,
> PodCallbackHandler podCallbackHandler) {
>                 return new KubernetesWatch(
>                         this.internalClient.pods()
>                                 .withLabels(labels)
>                                 .watch(new
> KubernetesPodsWatcher(podCallbackHandler)));
> }
>
> 而ETCD中只会保留一段时间的version信息
> 【 I think it's standard behavior of Kubernetes to give 410 after some time
> during watch. It's usually client's responsibility to handle it. In the
> context of a watch, it will return HTTP_GONE when you ask to see changes
> for
> a resourceVersion that is too old - i.e. when it can no longer tell you
> what
> has changed since that version, since too many things have changed. In that
> case, you'll need to start again, by not specifying a resourceVersion in
> which case the watch will send you the current state of the thing you are
> watching and then send updates from that point.】
>
> 大家有没遇到相同的问题,是怎么处理的?我有几个处理方式,希望能跟大家一起讨论一下。
>
>
>
>
> --
> Sent from: http://apache-flink.147419.n8.nabble.com/
>