Flink1.11.1版本job以Application Mode在K8S集群上运行,jobmanager每个小时会重启一次,报错【Fatal error
occurred in ResourceManager.io.fabric8.kubernetes.client.KubernetesClientException: too old resource version】 pod重启: <http://apache-flink.147419.n8.nabble.com/file/t1176/11.jpg> 重启原因: 2020-12-10 07:21:19,290 ERROR org.apache.flink.kubernetes.KubernetesResourceManager [] - Fatal error occurred in ResourceManager. io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 247468999 (248117930) at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259) [flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323) [flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219) [flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105) [flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274) [flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214) [flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206) [flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) [flink-dist_2.11-1.11.1.jar:1.11.1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_202] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_202] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_202] 2020-12-10 07:21:19,291 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error occurred in the cluster entrypoint. io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 247468999 (248117930) at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259) [flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323) [flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219) [flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105) [flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274) [flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214) [flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206) [flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) [flink-dist_2.11-1.11.1.jar:1.11.1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_202] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_202] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_202] 网上查的原因是因为: org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient类中212行 @Override public KubernetesWatch watchPodsAndDoCallback(Map<String, String> labels, PodCallbackHandler podCallbackHandler) { return new KubernetesWatch( this.internalClient.pods() .withLabels(labels) .watch(new KubernetesPodsWatcher(podCallbackHandler))); } 而ETCD中只会保留一段时间的version信息 【 I think it's standard behavior of Kubernetes to give 410 after some time during watch. It's usually client's responsibility to handle it. In the context of a watch, it will return HTTP_GONE when you ask to see changes for a resourceVersion that is too old - i.e. when it can no longer tell you what has changed since that version, since too many things have changed. In that case, you'll need to start again, by not specifying a resourceVersion in which case the watch will send you the current state of the thing you are watching and then send updates from that point.】 大家有没遇到相同的问题,是怎么处理的?我有几个处理方式,希望能跟大家一起讨论一下。 -- Sent from: http://apache-flink.147419.n8.nabble.com/ |
我之前在另一个邮件里面回复过,我再拷贝过来。
目前我已经建了一个JIRA来跟进too old resource version的问题[1] 在Flink里面采用了Watcher来监控Pod的状态变化,当Watcher被异常close的时候就会触发fatal error进而导致JobManager的重启 我这边做过一些具体的测试,在minikube、自建的K8s集群、阿里云ACK集群,稳定运行一周以上都是正常的。这个问题复现是通过重启 K8s的APIServer来做到的。所以我怀疑你那边Pod和APIServer之间的网络是不是不稳定,从而导致这个问题经常出现。 [1]. https://issues.apache.org/jira/browse/FLINK-20417 Best, Yang lichunguang <[hidden email]> 于2020年12月21日周一 下午3:51写道: > Flink1.11.1版本job以Application Mode在K8S集群上运行,jobmanager每个小时会重启一次,报错【Fatal > error > occurred in > ResourceManager.io.fabric8.kubernetes.client.KubernetesClientException: too > old resource version】 > > pod重启: > <http://apache-flink.147419.n8.nabble.com/file/t1176/11.jpg> > > 重启原因: > 2020-12-10 07:21:19,290 ERROR > org.apache.flink.kubernetes.KubernetesResourceManager [] - Fatal > error occurred in ResourceManager. > io.fabric8.kubernetes.client.KubernetesClientException: too old resource > version: 247468999 (248117930) > at > > io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259) > [flink-dist_2.11-1.11.1.jar:1.11.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws > .RealWebSocket.onReadMessage(RealWebSocket.java:323) > [flink-dist_2.11-1.11.1.jar:1.11.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws > .WebSocketReader.readMessageFrame(WebSocketReader.java:219) > [flink-dist_2.11-1.11.1.jar:1.11.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws > .WebSocketReader.processNextFrame(WebSocketReader.java:105) > [flink-dist_2.11-1.11.1.jar:1.11.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws > .RealWebSocket.loopReader(RealWebSocket.java:274) > [flink-dist_2.11-1.11.1.jar:1.11.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws > .RealWebSocket$2.onResponse(RealWebSocket.java:214) > [flink-dist_2.11-1.11.1.jar:1.11.1] > at > > org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206) > [flink-dist_2.11-1.11.1.jar:1.11.1] > at > > org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) > [flink-dist_2.11-1.11.1.jar:1.11.1] > at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [?:1.8.0_202] > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [?:1.8.0_202] > at java.lang.Thread.run(Thread.java:748) [?:1.8.0_202] > 2020-12-10 07:21:19,291 ERROR > org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal > error occurred in the cluster entrypoint. > io.fabric8.kubernetes.client.KubernetesClientException: too old resource > version: 247468999 (248117930) > at > > io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259) > [flink-dist_2.11-1.11.1.jar:1.11.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws > .RealWebSocket.onReadMessage(RealWebSocket.java:323) > [flink-dist_2.11-1.11.1.jar:1.11.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws > .WebSocketReader.readMessageFrame(WebSocketReader.java:219) > [flink-dist_2.11-1.11.1.jar:1.11.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws > .WebSocketReader.processNextFrame(WebSocketReader.java:105) > [flink-dist_2.11-1.11.1.jar:1.11.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws > .RealWebSocket.loopReader(RealWebSocket.java:274) > [flink-dist_2.11-1.11.1.jar:1.11.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws > .RealWebSocket$2.onResponse(RealWebSocket.java:214) > [flink-dist_2.11-1.11.1.jar:1.11.1] > at > > org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206) > [flink-dist_2.11-1.11.1.jar:1.11.1] > at > > org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) > [flink-dist_2.11-1.11.1.jar:1.11.1] > at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [?:1.8.0_202] > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [?:1.8.0_202] > at java.lang.Thread.run(Thread.java:748) [?:1.8.0_202] > > > 网上查的原因是因为: > org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient类中212行 > > @Override > public KubernetesWatch watchPodsAndDoCallback(Map<String, String> labels, > PodCallbackHandler podCallbackHandler) { > return new KubernetesWatch( > this.internalClient.pods() > .withLabels(labels) > .watch(new > KubernetesPodsWatcher(podCallbackHandler))); > } > > 而ETCD中只会保留一段时间的version信息 > 【 I think it's standard behavior of Kubernetes to give 410 after some time > during watch. It's usually client's responsibility to handle it. In the > context of a watch, it will return HTTP_GONE when you ask to see changes > for > a resourceVersion that is too old - i.e. when it can no longer tell you > what > has changed since that version, since too many things have changed. In that > case, you'll need to start again, by not specifying a resourceVersion in > which case the watch will send you the current state of the thing you are > watching and then send updates from that point.】 > > 大家有没遇到相同的问题,是怎么处理的?我有几个处理方式,希望能跟大家一起讨论一下。 > > > > > -- > Sent from: http://apache-flink.147419.n8.nabble.com/ > |
Free forum by Nabble | Edit this page |