K8s HA Session模式下1.12.1 jobmanager 周期性 restart

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

K8s HA Session模式下1.12.1 jobmanager 周期性 restart

macdoor
大约几十分钟就会restart,请教大佬们有查的思路,每次抛出的错误都是一样的,运行一段时间也会积累很多ConfigMap,下面是一个具体的错误

错误内容

2021-01-17 04:16:46,116 ERROR
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Fatal error occurred in ResourceManager.
org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Error
while watching the ConfigMap
test-flink-etl-42557c3f6325ffc876958430859178cd-jobmanager-leader
        at
org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver$ConfigMapCallbackHandlerImpl.handleFatalError(KubernetesLeaderRetrievalDriver.java:120)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
org.apache.flink.kubernetes.kubeclient.resources.AbstractKubernetesWatcher.onClose(AbstractKubernetesWatcher.java:48)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
io.fabric8.kubernetes.client.utils.WatcherToggle.onClose(WatcherToggle.java:56)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.closeEvent(WatchConnectionManager.java:367)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.access$700(WatchConnectionManager.java:50)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_275]
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_275]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_275]
2021-01-17 04:16:46,117 ERROR
org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Fatal
error occurred in the cluster entrypoint.
org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Error
while watching the ConfigMap
test-flink-etl-42557c3f6325ffc876958430859178cd-jobmanager-leader
        at
org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver$ConfigMapCallbackHandlerImpl.handleFatalError(KubernetesLeaderRetrievalDriver.java:120)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
org.apache.flink.kubernetes.kubeclient.resources.AbstractKubernetesWatcher.onClose(AbstractKubernetesWatcher.java:48)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
io.fabric8.kubernetes.client.utils.WatcherToggle.onClose(WatcherToggle.java:56)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.closeEvent(WatchConnectionManager.java:367)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.access$700(WatchConnectionManager.java:50)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_275]
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_275]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_275]
2021-01-17 04:16:46,164 INFO  org.apache.flink.runtime.blob.BlobServer                    
[] - Stopped BLOB server at 0.0.0.0:6124

jobmanager重启后,查看有这个 ConfigMap
test-flink-etl-42557c3f6325ffc876958430859178cd-jobmanager-leader

[gum@docker-repos ~]$ kubectl -n gem-flink get cm
test-flink-etl-42557c3f6325ffc876958430859178cd-jobmanager-leader -o yaml
apiVersion: v1
data:
  address: akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_3
  sessionId: c0f99c65-af3c-4916-ae7c-c272e2987e31
kind: ConfigMap
metadata:
  annotations:
    control-plane.alpha.kubernetes.io/leader:
'{"holderIdentity":"5fd98e66-8f6e-4871-b349-fd8760e9eb6b","leaseDuration":15.000000000,"acquireTime":"2021-01-17T03:43:12.444000Z","renewTime":"2021-01-17T03:51:52.460000Z","leaderTransitions":105}'
  creationTimestamp: "2021-01-17T03:43:12Z"
  labels:
    app: test-flink-etl
    configmap-type: high-availability
    type: flink-native-kubernetes
  name: test-flink-etl-42557c3f6325ffc876958430859178cd-jobmanager-leader
  namespace: gem-flink
  resourceVersion: "39527319"
  selfLink:
/api/v1/namespaces/gem-flink/configmaps/test-flink-etl-42557c3f6325ffc876958430859178cd-jobmanager-leader
  uid: 70b979b5-b696-47b7-8eb8-558e8887f2c9




--
Sent from: http://apache-flink.147419.n8.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: K8s HA Session模式下1.12.1 jobmanager 周期性 restart

Yang Wang
你搜索一下看看有没有too old resource version的报错
另外,测试一下Pod和APIServer的网络状态,是不是经常断

Best,
Yang

macdoor <[hidden email]> 于2021年1月18日周一 上午9:45写道:

> 大约几十分钟就会restart,请教大佬们有查的思路,每次抛出的错误都是一样的,运行一段时间也会积累很多ConfigMap,下面是一个具体的错误
>
> 错误内容
>
> 2021-01-17 04:16:46,116 ERROR
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Fatal error occurred in ResourceManager.
> org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Error
> while watching the ConfigMap
> test-flink-etl-42557c3f6325ffc876958430859178cd-jobmanager-leader
>         at
>
> org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver$ConfigMapCallbackHandlerImpl.handleFatalError(KubernetesLeaderRetrievalDriver.java:120)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
>
> org.apache.flink.kubernetes.kubeclient.resources.AbstractKubernetesWatcher.onClose(AbstractKubernetesWatcher.java:48)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
>
> io.fabric8.kubernetes.client.utils.WatcherToggle.onClose(WatcherToggle.java:56)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
>
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.closeEvent(WatchConnectionManager.java:367)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
>
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.access$700(WatchConnectionManager.java:50)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
>
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws
> .RealWebSocket.onReadMessage(RealWebSocket.java:323)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws
> .WebSocketReader.readMessageFrame(WebSocketReader.java:219)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws
> .WebSocketReader.processNextFrame(WebSocketReader.java:105)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws
> .RealWebSocket.loopReader(RealWebSocket.java:274)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws
> .RealWebSocket$2.onResponse(RealWebSocket.java:214)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
>
> org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
>
> org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [?:1.8.0_275]
>         at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [?:1.8.0_275]
>         at java.lang.Thread.run(Thread.java:748) [?:1.8.0_275]
> 2021-01-17 04:16:46,117 ERROR
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Fatal
> error occurred in the cluster entrypoint.
> org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Error
> while watching the ConfigMap
> test-flink-etl-42557c3f6325ffc876958430859178cd-jobmanager-leader
>         at
>
> org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver$ConfigMapCallbackHandlerImpl.handleFatalError(KubernetesLeaderRetrievalDriver.java:120)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
>
> org.apache.flink.kubernetes.kubeclient.resources.AbstractKubernetesWatcher.onClose(AbstractKubernetesWatcher.java:48)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
>
> io.fabric8.kubernetes.client.utils.WatcherToggle.onClose(WatcherToggle.java:56)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
>
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.closeEvent(WatchConnectionManager.java:367)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
>
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.access$700(WatchConnectionManager.java:50)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
>
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws
> .RealWebSocket.onReadMessage(RealWebSocket.java:323)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws
> .WebSocketReader.readMessageFrame(WebSocketReader.java:219)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws
> .WebSocketReader.processNextFrame(WebSocketReader.java:105)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws
> .RealWebSocket.loopReader(RealWebSocket.java:274)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws
> .RealWebSocket$2.onResponse(RealWebSocket.java:214)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
>
> org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
>
> org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [?:1.8.0_275]
>         at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [?:1.8.0_275]
>         at java.lang.Thread.run(Thread.java:748) [?:1.8.0_275]
> 2021-01-17 04:16:46,164 INFO  org.apache.flink.runtime.blob.BlobServer
>
> [] - Stopped BLOB server at 0.0.0.0:6124
>
> jobmanager重启后,查看有这个 ConfigMap
> test-flink-etl-42557c3f6325ffc876958430859178cd-jobmanager-leader
>
> [gum@docker-repos ~]$ kubectl -n gem-flink get cm
> test-flink-etl-42557c3f6325ffc876958430859178cd-jobmanager-leader -o yaml
> apiVersion: v1
> data:
>   address: akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_3
>   sessionId: c0f99c65-af3c-4916-ae7c-c272e2987e31
> kind: ConfigMap
> metadata:
>   annotations:
>     control-plane.alpha.kubernetes.io/leader:
>
> '{"holderIdentity":"5fd98e66-8f6e-4871-b349-fd8760e9eb6b","leaseDuration":15.000000000,"acquireTime":"2021-01-17T03:43:12.444000Z","renewTime":"2021-01-17T03:51:52.460000Z","leaderTransitions":105}'
>   creationTimestamp: "2021-01-17T03:43:12Z"
>   labels:
>     app: test-flink-etl
>     configmap-type: high-availability
>     type: flink-native-kubernetes
>   name: test-flink-etl-42557c3f6325ffc876958430859178cd-jobmanager-leader
>   namespace: gem-flink
>   resourceVersion: "39527319"
>   selfLink:
>
> /api/v1/namespaces/gem-flink/configmaps/test-flink-etl-42557c3f6325ffc876958430859178cd-jobmanager-leader
>   uid: 70b979b5-b696-47b7-8eb8-558e8887f2c9
>
>
>
>
> --
> Sent from: http://apache-flink.147419.n8.nabble.com/
>
Reply | Threaded
Open this post in threaded view
|

Re: K8s HA Session模式下1.12.1 jobmanager 周期性 restart

macdoor
我查看了一下之前的日志,没有发现 too old resource
version,而且连续几个日志都没有其他错误,直接就这个错误,restart,然后就是一个新日志了。

我用的k8s集群似乎网络确实不太稳定,请教一下如何测试Pod和APIServer之间的网络比较容易说明问题?ping?或者什么工具?



--
Sent from: http://apache-flink.147419.n8.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: K8s HA Session模式下1.12.1 jobmanager 周期性 restart

Yang Wang
可以用iperf来进行网络的测试,你需要在镜像里面提前安装好

另外,可以打开debug log看一下是不是Watch经过了很多次重试都连不上,才导致失败的

Best,
Yang

macdoor <[hidden email]> 于2021年1月18日周一 下午7:08写道:

> 我查看了一下之前的日志,没有发现 too old resource
> version,而且连续几个日志都没有其他错误,直接就这个错误,restart,然后就是一个新日志了。
>
> 我用的k8s集群似乎网络确实不太稳定,请教一下如何测试Pod和APIServer之间的网络比较容易说明问题?ping?或者什么工具?
>
>
>
> --
> Sent from: http://apache-flink.147419.n8.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: K8s HA Session模式下1.12.1 jobmanager 周期性 restart

macdoor
多谢!打开了DEBUG日志,仍然只有最后一个ERROR,不过之前有不少包含
kubernetes.client.dsl.internal.WatchConnectionManager  的日志,grep
了一部分,能看出些什么吗?

job-debug-0118.log:2021-01-19 02:12:25,551 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
WebSocket successfully opened
job-debug-0118.log:2021-01-19 02:12:25,646 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
Connecting websocket ...
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@2553d42c
job-debug-0118.log:2021-01-19 02:12:25,647 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
WebSocket successfully opened
job-debug-0118.log:2021-01-19 02:12:30,128 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
Connecting websocket ...
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@5a9fa83e
job-debug-0118.log:2021-01-19 02:12:30,176 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
WebSocket successfully opened
job-debug-0118.log:2021-01-19 02:12:39,028 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - Force
closing the watch
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@2553d42c
job-debug-0118.log:2021-01-19 02:12:39,028 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
Closing websocket
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket@15b15029
job-debug-0118.log:2021-01-19 02:12:39,030 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
WebSocket close received. code: 1000, reason:
job-debug-0118.log:2021-01-19 02:12:39,030 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
Ignoring onClose for already closed/closing websocket
job-debug-0118.log:2021-01-19 02:12:39,031 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - Force
closing the watch
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@2cdbe5a0
job-debug-0118.log:2021-01-19 02:12:39,031 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
Closing websocket
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket@1e3f5396
job-debug-0118.log:2021-01-19 02:12:39,033 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
WebSocket close received. code: 1000, reason:
job-debug-0118.log:2021-01-19 02:12:39,033 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
Ignoring onClose for already closed/closing websocket
job-debug-0118.log:2021-01-19 02:12:42,677 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
Connecting websocket ...
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@210aab4b
job-debug-0118.log:2021-01-19 02:12:42,678 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
WebSocket successfully opened
job-debug-0118.log:2021-01-19 02:12:42,920 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
Connecting websocket ...
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@278d8398
job-debug-0118.log:2021-01-19 02:12:42,921 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
WebSocket successfully opened
job-debug-0118.log:2021-01-19 02:12:45,130 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
Connecting websocket ...
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@4b318628
job-debug-0118.log:2021-01-19 02:12:45,132 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
WebSocket successfully opened
job-debug-0118.log:2021-01-19 02:13:05,927 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - Force
closing the watch
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@278d8398
job-debug-0118.log:2021-01-19 02:13:05,927 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
Closing websocket
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket@69d1ebd2
job-debug-0118.log:2021-01-19 02:13:05,930 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
WebSocket close received. code: 1000, reason:
job-debug-0118.log:2021-01-19 02:13:05,930 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
Ignoring onClose for already closed/closing websocket
job-debug-0118.log:2021-01-19 02:13:05,940 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - Force
closing the watch
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@210aab4b
job-debug-0118.log:2021-01-19 02:13:05,940 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
Closing websocket
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket@3db9d8d8
job-debug-0118.log:2021-01-19 02:13:05,942 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
WebSocket close received. code: 1000, reason:
job-debug-0118.log:2021-01-19 02:13:05,942 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
Ignoring onClose for already closed/closing websocket
job-debug-0118.log:2021-01-19 02:13:08,378 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
Connecting websocket ...
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@4dcf905
job-debug-0118.log:2021-01-19 02:13:08,381 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
WebSocket successfully opened
job-debug-0118.log:2021-01-19 02:13:08,471 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
Connecting websocket ...
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@428ca061
job-debug-0118.log:2021-01-19 02:13:08,472 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
WebSocket successfully opened
job-debug-0118.log:2021-01-19 02:13:10,127 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
Connecting websocket ...
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@46b49e58
job-debug-0118.log:2021-01-19 02:13:10,128 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
WebSocket successfully opened
job-debug-0118.log:2021-01-19 02:13:21,625 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - Force
closing the watch
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@428ca061
job-debug-0118.log:2021-01-19 02:13:21,625 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
Closing websocket
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket@14e16427
job-debug-0118.log:2021-01-19 02:13:21,627 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
WebSocket close received. code: 1000, reason:
job-debug-0118.log:2021-01-19 02:13:21,627 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
Ignoring onClose for already closed/closing websocket
job-debug-0118.log:2021-01-19 02:13:21,628 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - Force
closing the watch
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@4dcf905
job-debug-0118.log:2021-01-19 02:13:21,628 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
Closing websocket
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket@11708e54
job-debug-0118.log:2021-01-19 02:13:21,630 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
WebSocket close received. code: 1000, reason:
job-debug-0118.log:2021-01-19 02:13:21,630 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
Ignoring onClose for already closed/closing websocket
job-debug-0118.log:2021-01-19 02:13:25,680 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
Connecting websocket ...
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@3ba4abd7
job-debug-0118.log:2021-01-19 02:13:25,681 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
WebSocket successfully opened
job-debug-0118.log:2021-01-19 02:13:25,908 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
Connecting websocket ...
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@23fe4bdd
job-debug-0118.log:2021-01-19 02:13:25,909 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
WebSocket successfully opened
job-debug-0118.log:2021-01-19 02:13:30,128 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
Connecting websocket ...
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@5cf8bd92
job-debug-0118.log:2021-01-19 02:13:30,175 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
WebSocket successfully opened
job-debug-0118.log:2021-01-19 02:13:46,104 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
WebSocket close received. code: 1000, reason:
job-debug-0118.log:2021-01-19 02:13:46,105 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
Submitting reconnect task to the executor
job-debug-0118.log:2021-01-19 02:13:46,113 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
Scheduling reconnect task
job-debug-0118.log:2021-01-19 02:13:46,117 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
Current reconnect backoff is 1000 milliseconds (T0)
job-debug-0118.log:2021-01-19 02:13:47,117 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
Connecting websocket ...
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@23f03575
job-debug-0118.log:2021-01-19 02:13:47,120 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
WebSocket successfully opened
job-debug-0118.log: at
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.closeEvent(WatchConnectionManager.java:367)
[flink-dist_2.11-1.12.1.jar:1.12.1]
job-debug-0118.log: at
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.access$700(WatchConnectionManager.java:50)
[flink-dist_2.11-1.12.1.jar:1.12.1]
job-debug-0118.log: at
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)
[flink-dist_2.11-1.12.1.jar:1.12.1]
job-debug-0118.log: at
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.closeEvent(WatchConnectionManager.java:367)
[flink-dist_2.11-1.12.1.jar:1.12.1]
job-debug-0118.log: at
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.access$700(WatchConnectionManager.java:50)
[flink-dist_2.11-1.12.1.jar:1.12.1]
job-debug-0118.log: at
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)
[flink-dist_2.11-1.12.1.jar:1.12.1]


最后的ERROR是这样的

2021-01-19 02:13:47,094 DEBUG
org.apache.flink.runtime.source.coordinator.SourceCoordinator [] - Handling
event from subtask 406 of source Source:
HiveSource-snmpprobe.p_snmp_ifXTable: RequestSplitEvent (host='172.0.37.8')
2021-01-19 02:13:47,094 INFO
org.apache.flink.connector.file.src.impl.StaticFileSplitEnumerator [] -
Subtask 406 (on host '172.0.37.8') is requesting a file source split
2021-01-19 02:13:47,094 INFO
org.apache.flink.connector.file.src.impl.StaticFileSplitEnumerator [] - No
more splits available for subtask 406
2021-01-19 02:13:47,097 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
HiveSource-snmpprobe.p_snmp_ifXTable (318/458)
(710557b37a1e03f0f462ab5303842489) switched from RUNNING to FINISHED.
2021-01-19 02:13:47,097 DEBUG
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Ignoring
transition of vertex Source: HiveSource-snmpprobe.p_snmp_ifXTable (318/458)
- execution #0 to FAILED while being FINISHED.
2021-01-19 02:13:47,097 DEBUG org.apache.flink.runtime.scheduler.SharedSlot              
[] - Remove logical slot (SlotRequestId{988c43b8a7b427ea962685f057438880})
for execution vertex (id 605b35e407e90cda15ad084365733fdd_317) from the
physical slot (SlotRequestId{37b03b71035c9d8c564bb7c299ee9b3d})
2021-01-19 02:13:47,097 DEBUG org.apache.flink.runtime.scheduler.SharedSlot              
[] - Release shared slot externally
(SlotRequestId{37b03b71035c9d8c564bb7c299ee9b3d})
2021-01-19 02:13:47,097 DEBUG
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Releasing
slot [SlotRequestId{37b03b71035c9d8c564bb7c299ee9b3d}] because: Slot is
being returned from SlotSharingExecutionSlotAllocator.
2021-01-19 02:13:47,097 DEBUG org.apache.flink.runtime.scheduler.SharedSlot              
[] - Release shared slot (SlotRequestId{37b03b71035c9d8c564bb7c299ee9b3d})
2021-01-19 02:13:47,097 DEBUG
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Fulfilling
pending slot request [SlotRequestId{bb5a48db898111288c811359cc2d7f51}] with
slot [385153f7c5efff54be584439258f7352]
2021-01-19 02:13:47,097 DEBUG org.apache.flink.runtime.scheduler.SharedSlot              
[] - Allocated logical slot
(SlotRequestId{78f370c05403ab3d703a8d89c19d23c8}) for execution vertex (id
605b35e407e90cda15ad084365733fdd_419) from the physical slot
(SlotRequestId{bb5a48db898111288c811359cc2d7f51})
2021-01-19 02:13:47,097 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
HiveSource-snmpprobe.p_snmp_ifXTable (420/458)
(d04edd6e11b7cdc9e88c0ab6d756fed2) switched from SCHEDULED to DEPLOYING.
2021-01-19 02:13:47,097 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Deploying
Source: HiveSource-snmpprobe.p_snmp_ifXTable (420/458) (attempt #0) with
attempt id d04edd6e11b7cdc9e88c0ab6d756fed2 to 172.0.42.250:6122-5d505f @
172-0-42-250.flink-taskmanager-query-state.gem-flink.svc.cluster.local
(dataPort=40697) with allocation id 385153f7c5efff54be584439258f7352
2021-01-19 02:13:47,097 DEBUG
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl [] -
Cancel slot request 4da96bc97ef9b47ba7e408c78835d75a.
2021-01-19 02:13:47,100 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
HiveSource-snmpprobe.p_snmp_ifXTable (413/458)
(2124587d1641d6cb05c05dfc742e8423) switched from DEPLOYING to RUNNING.
2021-01-19 02:13:47,100 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
HiveSource-snmpprobe.p_snmp_ifXTable (414/458)
(4aa88cc1b61dc5b056ad59d373392c2f) switched from DEPLOYING to RUNNING.
2021-01-19 02:13:47,100 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
HiveSource-snmpprobe.p_snmp_ifXTable (415/458)
(225d9a604e6b852f2ea6e87ebcf3107c) switched from DEPLOYING to RUNNING.
2021-01-19 02:13:47,112 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
HiveSource-snmpprobe.p_snmp_ifXTable (417/458)
(c1a78898e76b3f1761cd5be1913dd24c) switched from DEPLOYING to RUNNING.
2021-01-19 02:13:47,112 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
HiveSource-snmpprobe.p_snmp_ifXTable (418/458)
(6deb899dbd8bf373b349b980d1e78506) switched from DEPLOYING to RUNNING.
2021-01-19 02:13:47,113 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
HiveSource-snmpprobe.p_snmp_ifXTable (416/458)
(5a99fa4a1d8bdbf93345a9b20ae1fa91) switched from DEPLOYING to RUNNING.
2021-01-19 02:13:47,117 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
HiveSource-snmpprobe.p_snmp_ifXTable (406/458)
(d1f9edd1bfdd80eef6b32b8850020130) switched from RUNNING to FINISHED.
2021-01-19 02:13:47,117 DEBUG
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Ignoring
transition of vertex Source: HiveSource-snmpprobe.p_snmp_ifXTable (406/458)
- execution #0 to FAILED while being FINISHED.
2021-01-19 02:13:47,117 DEBUG org.apache.flink.runtime.scheduler.SharedSlot              
[] - Remove logical slot (SlotRequestId{037efe676c5cec5fe6b549d3ebd5f72b})
for execution vertex (id 605b35e407e90cda15ad084365733fdd_405) from the
physical slot (SlotRequestId{a840e61c33cb3f250cfb54652c87aa64})
2021-01-19 02:13:47,117 DEBUG org.apache.flink.runtime.scheduler.SharedSlot              
[] - Release shared slot externally
(SlotRequestId{a840e61c33cb3f250cfb54652c87aa64})
2021-01-19 02:13:47,117 DEBUG
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Releasing
slot [SlotRequestId{a840e61c33cb3f250cfb54652c87aa64}] because: Slot is
being returned from SlotSharingExecutionSlotAllocator.
2021-01-19 02:13:47,117 DEBUG org.apache.flink.runtime.scheduler.SharedSlot              
[] - Release shared slot (SlotRequestId{a840e61c33cb3f250cfb54652c87aa64})
2021-01-19 02:13:47,117 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
Connecting websocket ...
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@23f03575
2021-01-19 02:13:47,117 DEBUG
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Fulfilling
pending slot request [SlotRequestId{538f35a507cd0949bf547588eb436b49}] with
slot [ebf6ebbb9abe3a9e6ccb56e235b00b53]
2021-01-19 02:13:47,117 DEBUG org.apache.flink.runtime.scheduler.SharedSlot              
[] - Allocated logical slot
(SlotRequestId{a9bac8016853f9e86963b8ee11dea18f}) for execution vertex (id
605b35e407e90cda15ad084365733fdd_420) from the physical slot
(SlotRequestId{538f35a507cd0949bf547588eb436b49})
2021-01-19 02:13:47,117 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
HiveSource-snmpprobe.p_snmp_ifXTable (421/458)
(706d012e7a572e1e5786536df9ab3bbb) switched from SCHEDULED to DEPLOYING.
2021-01-19 02:13:47,117 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Deploying
Source: HiveSource-snmpprobe.p_snmp_ifXTable (421/458) (attempt #0) with
attempt id 706d012e7a572e1e5786536df9ab3bbb to 172.0.37.8:6122-694869 @
172-0-37-8.flink-taskmanager-query-state.gem-flink.svc.cluster.local
(dataPort=32959) with allocation id ebf6ebbb9abe3a9e6ccb56e235b00b53
2021-01-19 02:13:47,117 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
HiveSource-snmpprobe.p_snmp_ifXTable (407/458)
(396cb3fdd115a31d8575407fa9ee6e07) switched from RUNNING to FINISHED.
2021-01-19 02:13:47,117 DEBUG
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Ignoring
transition of vertex Source: HiveSource-snmpprobe.p_snmp_ifXTable (407/458)
- execution #0 to FAILED while being FINISHED.
2021-01-19 02:13:47,117 DEBUG org.apache.flink.runtime.scheduler.SharedSlot              
[] - Remove logical slot (SlotRequestId{88e116a3dd0a40ef692734548aac9682})
for execution vertex (id 605b35e407e90cda15ad084365733fdd_406) from the
physical slot (SlotRequestId{9bb6a1762363d3996aded34c82abab54})
2021-01-19 02:13:47,117 DEBUG org.apache.flink.runtime.scheduler.SharedSlot              
[] - Release shared slot externally
(SlotRequestId{9bb6a1762363d3996aded34c82abab54})
2021-01-19 02:13:47,117 DEBUG
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Releasing
slot [SlotRequestId{9bb6a1762363d3996aded34c82abab54}] because: Slot is
being returned from SlotSharingExecutionSlotAllocator.
2021-01-19 02:13:47,117 DEBUG org.apache.flink.runtime.scheduler.SharedSlot              
[] - Release shared slot (SlotRequestId{9bb6a1762363d3996aded34c82abab54})
2021-01-19 02:13:47,117 DEBUG
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Fulfilling
pending slot request [SlotRequestId{88a7127cc4a86be0a962b9aa68d4feff}] with
slot [a30c9937af2c6de7ab471086cc9268f5]
2021-01-19 02:13:47,117 DEBUG org.apache.flink.runtime.scheduler.SharedSlot              
[] - Allocated logical slot
(SlotRequestId{da65200b5d50dcaaaff3f4373dd824c4}) for execution vertex (id
605b35e407e90cda15ad084365733fdd_421) from the physical slot
(SlotRequestId{88a7127cc4a86be0a962b9aa68d4feff})
2021-01-19 02:13:47,117 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
HiveSource-snmpprobe.p_snmp_ifXTable (422/458)
(f12be47a0d11892e411d1afcb928b55a) switched from SCHEDULED to DEPLOYING.
2021-01-19 02:13:47,117 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Deploying
Source: HiveSource-snmpprobe.p_snmp_ifXTable (422/458) (attempt #0) with
attempt id f12be47a0d11892e411d1afcb928b55a to 172.0.37.8:6122-694869 @
172-0-37-8.flink-taskmanager-query-state.gem-flink.svc.cluster.local
(dataPort=32959) with allocation id a30c9937af2c6de7ab471086cc9268f5
2021-01-19 02:13:47,117 DEBUG
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl [] -
Cancel slot request 0a1cfa83b0d664615e0b9e1f938d7dee.
2021-01-19 02:13:47,117 DEBUG
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl [] -
Cancel slot request c1d63c4ffdf6e17212c4ca6be4071850.
2021-01-19 02:13:47,120 DEBUG
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
WebSocket successfully opened
2021-01-19 02:13:47,123 ERROR
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Fatal error occurred in ResourceManager.
org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Error
while watching the ConfigMap
test-flink-etl-cb1c647ea7488765fd3e8cc1dc691e46-jobmanager-leader
        at
org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver$ConfigMapCallbackHandlerImpl.handleFatalError(KubernetesLeaderRetrievalDriver.java:120)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
org.apache.flink.kubernetes.kubeclient.resources.AbstractKubernetesWatcher.onClose(AbstractKubernetesWatcher.java:48)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
io.fabric8.kubernetes.client.utils.WatcherToggle.onClose(WatcherToggle.java:56)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.closeEvent(WatchConnectionManager.java:367)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.access$700(WatchConnectionManager.java:50)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_275]
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_275]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_275]
2021-01-19 02:13:47,124 ERROR
org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Fatal
error occurred in the cluster entrypoint.
org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Error
while watching the ConfigMap
test-flink-etl-cb1c647ea7488765fd3e8cc1dc691e46-jobmanager-leader
        at
org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver$ConfigMapCallbackHandlerImpl.handleFatalError(KubernetesLeaderRetrievalDriver.java:120)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
org.apache.flink.kubernetes.kubeclient.resources.AbstractKubernetesWatcher.onClose(AbstractKubernetesWatcher.java:48)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
io.fabric8.kubernetes.client.utils.WatcherToggle.onClose(WatcherToggle.java:56)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.closeEvent(WatchConnectionManager.java:367)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.access$700(WatchConnectionManager.java:50)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
[flink-dist_2.11-1.12.1.jar:1.12.1]
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_275]
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_275]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_275]
2021-01-19 02:13:47,125 DEBUG
org.apache.flink.runtime.source.coordinator.SourceCoordinator [] - Handling
event from subtask 365 of source Source:
HiveSource-snmpprobe.p_snmp_ifXTable: ReaderRegistrationEvent[subtaskId =
365, location = 172.0.37.16)
2021-01-19 02:13:47,125 DEBUG
org.apache.flink.runtime.source.coordinator.SourceCoordinator [] - Handling
event from subtask 365 of source Source:
HiveSource-snmpprobe.p_snmp_ifXTable: RequestSplitEvent (host='172.0.37.16')
2021-01-19 02:13:47,125 INFO
org.apache.flink.connector.file.src.impl.StaticFileSplitEnumerator [] -
Subtask 365 (on host '172.0.37.16') is requesting a file source split
2021-01-19 02:13:47,125 INFO
org.apache.flink.connector.file.src.impl.StaticFileSplitEnumerator [] - No
more splits available for subtask 365
2021-01-19 02:13:47,125 DEBUG
org.apache.flink.runtime.source.coordinator.SourceCoordinator [] - Handling
event from subtask 379 of source Source:
HiveSource-snmpprobe.p_snmp_ifXTable: ReaderRegistrationEvent[subtaskId =
379, location = 172.0.37.16)
2021-01-19 02:13:47,125 DEBUG
org.apache.flink.runtime.source.coordinator.SourceCoordinator [] - Handling
event from subtask 379 of source Source:
HiveSource-snmpprobe.p_snmp_ifXTable: RequestSplitEvent (host='172.0.37.16')
2021-01-19 02:13:47,125 INFO
org.apache.flink.connector.file.src.impl.StaticFileSplitEnumerator [] -
Subtask 379 (on host '172.0.37.16') is requesting a file source split
2021-01-19 02:13:47,125 INFO
org.apache.flink.connector.file.src.impl.StaticFileSplitEnumerator [] - No
more splits available for subtask 379
2021-01-19 02:13:47,130 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
HiveSource-snmpprobe.p_snmp_ifXTable (389/458)
(b0d8b877b1911ffca609f818693b68ad) switched from DEPLOYING to RUNNING.
2021-01-19 02:13:47,131 INFO  org.apache.flink.runtime.blob.BlobServer                    
[] - Stopped BLOB server at 0.0.0.0:6124
2021-01-19 02:13:47,132 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
HiveSource-snmpprobe.p_snmp_ifXTable (421/458)
(706d012e7a572e1e5786536df9ab3bbb) switched from DEPLOYING to RUNNING.





--
Sent from: http://apache-flink.147419.n8.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: K8s HA Session模式下1.12.1 jobmanager 周期性 restart

Yang Wang
看着是有很多Connecting websocket 和 Scheduling reconnect task的log
我觉得还是你的Pod和APIServer的网络不是很稳定

另外,可以的话,你把DEBUG级别的JobManager完整log发一下

Best,
Yang

macdoor <[hidden email]> 于2021年1月19日周二 上午9:31写道:

> 多谢!打开了DEBUG日志,仍然只有最后一个ERROR,不过之前有不少包含
> kubernetes.client.dsl.internal.WatchConnectionManager  的日志,grep
> 了一部分,能看出些什么吗?
>
> job-debug-0118.log:2021-01-19 02:12:25,551 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> WebSocket successfully opened
> job-debug-0118.log:2021-01-19 02:12:25,646 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> Connecting websocket ...
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@2553d42c
> job-debug-0118.log:2021-01-19 02:12:25,647 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> WebSocket successfully opened
> job-debug-0118.log:2021-01-19 02:12:30,128 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> Connecting websocket ...
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@5a9fa83e
> job-debug-0118.log:2021-01-19 02:12:30,176 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> WebSocket successfully opened
> job-debug-0118.log:2021-01-19 02:12:39,028 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - Force
> closing the watch
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@2553d42c
> job-debug-0118.log:2021-01-19 02:12:39,028 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> Closing websocket
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws
> .RealWebSocket@15b15029
> job-debug-0118.log:2021-01-19 02:12:39,030 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> WebSocket close received. code: 1000, reason:
> job-debug-0118.log:2021-01-19 02:12:39,030 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> Ignoring onClose for already closed/closing websocket
> job-debug-0118.log:2021-01-19 02:12:39,031 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - Force
> closing the watch
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@2cdbe5a0
> job-debug-0118.log:2021-01-19 02:12:39,031 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> Closing websocket
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws
> .RealWebSocket@1e3f5396
> job-debug-0118.log:2021-01-19 02:12:39,033 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> WebSocket close received. code: 1000, reason:
> job-debug-0118.log:2021-01-19 02:12:39,033 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> Ignoring onClose for already closed/closing websocket
> job-debug-0118.log:2021-01-19 02:12:42,677 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> Connecting websocket ...
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@210aab4b
> job-debug-0118.log:2021-01-19 02:12:42,678 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> WebSocket successfully opened
> job-debug-0118.log:2021-01-19 02:12:42,920 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> Connecting websocket ...
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@278d8398
> job-debug-0118.log:2021-01-19 02:12:42,921 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> WebSocket successfully opened
> job-debug-0118.log:2021-01-19 02:12:45,130 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> Connecting websocket ...
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@4b318628
> job-debug-0118.log:2021-01-19 02:12:45,132 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> WebSocket successfully opened
> job-debug-0118.log:2021-01-19 02:13:05,927 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - Force
> closing the watch
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@278d8398
> job-debug-0118.log:2021-01-19 02:13:05,927 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> Closing websocket
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws
> .RealWebSocket@69d1ebd2
> job-debug-0118.log:2021-01-19 02:13:05,930 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> WebSocket close received. code: 1000, reason:
> job-debug-0118.log:2021-01-19 02:13:05,930 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> Ignoring onClose for already closed/closing websocket
> job-debug-0118.log:2021-01-19 02:13:05,940 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - Force
> closing the watch
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@210aab4b
> job-debug-0118.log:2021-01-19 02:13:05,940 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> Closing websocket
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws
> .RealWebSocket@3db9d8d8
> job-debug-0118.log:2021-01-19 02:13:05,942 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> WebSocket close received. code: 1000, reason:
> job-debug-0118.log:2021-01-19 02:13:05,942 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> Ignoring onClose for already closed/closing websocket
> job-debug-0118.log:2021-01-19 02:13:08,378 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> Connecting websocket ...
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@4dcf905
> job-debug-0118.log:2021-01-19 02:13:08,381 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> WebSocket successfully opened
> job-debug-0118.log:2021-01-19 02:13:08,471 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> Connecting websocket ...
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@428ca061
> job-debug-0118.log:2021-01-19 02:13:08,472 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> WebSocket successfully opened
> job-debug-0118.log:2021-01-19 02:13:10,127 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> Connecting websocket ...
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@46b49e58
> job-debug-0118.log:2021-01-19 02:13:10,128 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> WebSocket successfully opened
> job-debug-0118.log:2021-01-19 02:13:21,625 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - Force
> closing the watch
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@428ca061
> job-debug-0118.log:2021-01-19 02:13:21,625 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> Closing websocket
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws
> .RealWebSocket@14e16427
> job-debug-0118.log:2021-01-19 02:13:21,627 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> WebSocket close received. code: 1000, reason:
> job-debug-0118.log:2021-01-19 02:13:21,627 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> Ignoring onClose for already closed/closing websocket
> job-debug-0118.log:2021-01-19 02:13:21,628 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - Force
> closing the watch
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@4dcf905
> job-debug-0118.log:2021-01-19 02:13:21,628 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> Closing websocket
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws
> .RealWebSocket@11708e54
> job-debug-0118.log:2021-01-19 02:13:21,630 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> WebSocket close received. code: 1000, reason:
> job-debug-0118.log:2021-01-19 02:13:21,630 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> Ignoring onClose for already closed/closing websocket
> job-debug-0118.log:2021-01-19 02:13:25,680 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> Connecting websocket ...
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@3ba4abd7
> job-debug-0118.log:2021-01-19 02:13:25,681 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> WebSocket successfully opened
> job-debug-0118.log:2021-01-19 02:13:25,908 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> Connecting websocket ...
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@23fe4bdd
> job-debug-0118.log:2021-01-19 02:13:25,909 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> WebSocket successfully opened
> job-debug-0118.log:2021-01-19 02:13:30,128 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> Connecting websocket ...
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@5cf8bd92
> job-debug-0118.log:2021-01-19 02:13:30,175 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> WebSocket successfully opened
> job-debug-0118.log:2021-01-19 02:13:46,104 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> WebSocket close received. code: 1000, reason:
> job-debug-0118.log:2021-01-19 02:13:46,105 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> Submitting reconnect task to the executor
> job-debug-0118.log:2021-01-19 02:13:46,113 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> Scheduling reconnect task
> job-debug-0118.log:2021-01-19 02:13:46,117 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> Current reconnect backoff is 1000 milliseconds (T0)
> job-debug-0118.log:2021-01-19 02:13:47,117 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> Connecting websocket ...
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@23f03575
> job-debug-0118.log:2021-01-19 02:13:47,120 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> WebSocket successfully opened
> job-debug-0118.log:     at
>
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.closeEvent(WatchConnectionManager.java:367)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
> job-debug-0118.log:     at
>
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.access$700(WatchConnectionManager.java:50)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
> job-debug-0118.log:     at
>
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
> job-debug-0118.log:     at
>
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.closeEvent(WatchConnectionManager.java:367)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
> job-debug-0118.log:     at
>
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.access$700(WatchConnectionManager.java:50)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
> job-debug-0118.log:     at
>
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>
>
> 最后的ERROR是这样的
>
> 2021-01-19 02:13:47,094 DEBUG
> org.apache.flink.runtime.source.coordinator.SourceCoordinator [] - Handling
> event from subtask 406 of source Source:
> HiveSource-snmpprobe.p_snmp_ifXTable: RequestSplitEvent (host='172.0.37.8')
> 2021-01-19 02:13:47,094 INFO
> org.apache.flink.connector.file.src.impl.StaticFileSplitEnumerator [] -
> Subtask 406 (on host '172.0.37.8') is requesting a file source split
> 2021-01-19 02:13:47,094 INFO
> org.apache.flink.connector.file.src.impl.StaticFileSplitEnumerator [] - No
> more splits available for subtask 406
> 2021-01-19 02:13:47,097 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> HiveSource-snmpprobe.p_snmp_ifXTable (318/458)
> (710557b37a1e03f0f462ab5303842489) switched from RUNNING to FINISHED.
> 2021-01-19 02:13:47,097 DEBUG
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Ignoring
> transition of vertex Source: HiveSource-snmpprobe.p_snmp_ifXTable (318/458)
> - execution #0 to FAILED while being FINISHED.
> 2021-01-19 02:13:47,097 DEBUG
> org.apache.flink.runtime.scheduler.SharedSlot
> [] - Remove logical slot (SlotRequestId{988c43b8a7b427ea962685f057438880})
> for execution vertex (id 605b35e407e90cda15ad084365733fdd_317) from the
> physical slot (SlotRequestId{37b03b71035c9d8c564bb7c299ee9b3d})
> 2021-01-19 02:13:47,097 DEBUG
> org.apache.flink.runtime.scheduler.SharedSlot
> [] - Release shared slot externally
> (SlotRequestId{37b03b71035c9d8c564bb7c299ee9b3d})
> 2021-01-19 02:13:47,097 DEBUG
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Releasing
> slot [SlotRequestId{37b03b71035c9d8c564bb7c299ee9b3d}] because: Slot is
> being returned from SlotSharingExecutionSlotAllocator.
> 2021-01-19 02:13:47,097 DEBUG
> org.apache.flink.runtime.scheduler.SharedSlot
> [] - Release shared slot (SlotRequestId{37b03b71035c9d8c564bb7c299ee9b3d})
> 2021-01-19 02:13:47,097 DEBUG
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Fulfilling
> pending slot request [SlotRequestId{bb5a48db898111288c811359cc2d7f51}] with
> slot [385153f7c5efff54be584439258f7352]
> 2021-01-19 02:13:47,097 DEBUG
> org.apache.flink.runtime.scheduler.SharedSlot
> [] - Allocated logical slot
> (SlotRequestId{78f370c05403ab3d703a8d89c19d23c8}) for execution vertex (id
> 605b35e407e90cda15ad084365733fdd_419) from the physical slot
> (SlotRequestId{bb5a48db898111288c811359cc2d7f51})
> 2021-01-19 02:13:47,097 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> HiveSource-snmpprobe.p_snmp_ifXTable (420/458)
> (d04edd6e11b7cdc9e88c0ab6d756fed2) switched from SCHEDULED to DEPLOYING.
> 2021-01-19 02:13:47,097 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Deploying
> Source: HiveSource-snmpprobe.p_snmp_ifXTable (420/458) (attempt #0) with
> attempt id d04edd6e11b7cdc9e88c0ab6d756fed2 to 172.0.42.250:6122-5d505f @
> 172-0-42-250.flink-taskmanager-query-state.gem-flink.svc.cluster.local
> (dataPort=40697) with allocation id 385153f7c5efff54be584439258f7352
> 2021-01-19 02:13:47,097 DEBUG
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl [] -
> Cancel slot request 4da96bc97ef9b47ba7e408c78835d75a.
> 2021-01-19 02:13:47,100 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> HiveSource-snmpprobe.p_snmp_ifXTable (413/458)
> (2124587d1641d6cb05c05dfc742e8423) switched from DEPLOYING to RUNNING.
> 2021-01-19 02:13:47,100 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> HiveSource-snmpprobe.p_snmp_ifXTable (414/458)
> (4aa88cc1b61dc5b056ad59d373392c2f) switched from DEPLOYING to RUNNING.
> 2021-01-19 02:13:47,100 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> HiveSource-snmpprobe.p_snmp_ifXTable (415/458)
> (225d9a604e6b852f2ea6e87ebcf3107c) switched from DEPLOYING to RUNNING.
> 2021-01-19 02:13:47,112 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> HiveSource-snmpprobe.p_snmp_ifXTable (417/458)
> (c1a78898e76b3f1761cd5be1913dd24c) switched from DEPLOYING to RUNNING.
> 2021-01-19 02:13:47,112 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> HiveSource-snmpprobe.p_snmp_ifXTable (418/458)
> (6deb899dbd8bf373b349b980d1e78506) switched from DEPLOYING to RUNNING.
> 2021-01-19 02:13:47,113 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> HiveSource-snmpprobe.p_snmp_ifXTable (416/458)
> (5a99fa4a1d8bdbf93345a9b20ae1fa91) switched from DEPLOYING to RUNNING.
> 2021-01-19 02:13:47,117 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> HiveSource-snmpprobe.p_snmp_ifXTable (406/458)
> (d1f9edd1bfdd80eef6b32b8850020130) switched from RUNNING to FINISHED.
> 2021-01-19 02:13:47,117 DEBUG
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Ignoring
> transition of vertex Source: HiveSource-snmpprobe.p_snmp_ifXTable (406/458)
> - execution #0 to FAILED while being FINISHED.
> 2021-01-19 02:13:47,117 DEBUG
> org.apache.flink.runtime.scheduler.SharedSlot
> [] - Remove logical slot (SlotRequestId{037efe676c5cec5fe6b549d3ebd5f72b})
> for execution vertex (id 605b35e407e90cda15ad084365733fdd_405) from the
> physical slot (SlotRequestId{a840e61c33cb3f250cfb54652c87aa64})
> 2021-01-19 02:13:47,117 DEBUG
> org.apache.flink.runtime.scheduler.SharedSlot
> [] - Release shared slot externally
> (SlotRequestId{a840e61c33cb3f250cfb54652c87aa64})
> 2021-01-19 02:13:47,117 DEBUG
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Releasing
> slot [SlotRequestId{a840e61c33cb3f250cfb54652c87aa64}] because: Slot is
> being returned from SlotSharingExecutionSlotAllocator.
> 2021-01-19 02:13:47,117 DEBUG
> org.apache.flink.runtime.scheduler.SharedSlot
> [] - Release shared slot (SlotRequestId{a840e61c33cb3f250cfb54652c87aa64})
> 2021-01-19 02:13:47,117 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> Connecting websocket ...
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@23f03575
> 2021-01-19 02:13:47,117 DEBUG
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Fulfilling
> pending slot request [SlotRequestId{538f35a507cd0949bf547588eb436b49}] with
> slot [ebf6ebbb9abe3a9e6ccb56e235b00b53]
> 2021-01-19 02:13:47,117 DEBUG
> org.apache.flink.runtime.scheduler.SharedSlot
> [] - Allocated logical slot
> (SlotRequestId{a9bac8016853f9e86963b8ee11dea18f}) for execution vertex (id
> 605b35e407e90cda15ad084365733fdd_420) from the physical slot
> (SlotRequestId{538f35a507cd0949bf547588eb436b49})
> 2021-01-19 02:13:47,117 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> HiveSource-snmpprobe.p_snmp_ifXTable (421/458)
> (706d012e7a572e1e5786536df9ab3bbb) switched from SCHEDULED to DEPLOYING.
> 2021-01-19 02:13:47,117 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Deploying
> Source: HiveSource-snmpprobe.p_snmp_ifXTable (421/458) (attempt #0) with
> attempt id 706d012e7a572e1e5786536df9ab3bbb to 172.0.37.8:6122-694869 @
> 172-0-37-8.flink-taskmanager-query-state.gem-flink.svc.cluster.local
> (dataPort=32959) with allocation id ebf6ebbb9abe3a9e6ccb56e235b00b53
> 2021-01-19 02:13:47,117 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> HiveSource-snmpprobe.p_snmp_ifXTable (407/458)
> (396cb3fdd115a31d8575407fa9ee6e07) switched from RUNNING to FINISHED.
> 2021-01-19 02:13:47,117 DEBUG
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Ignoring
> transition of vertex Source: HiveSource-snmpprobe.p_snmp_ifXTable (407/458)
> - execution #0 to FAILED while being FINISHED.
> 2021-01-19 02:13:47,117 DEBUG
> org.apache.flink.runtime.scheduler.SharedSlot
> [] - Remove logical slot (SlotRequestId{88e116a3dd0a40ef692734548aac9682})
> for execution vertex (id 605b35e407e90cda15ad084365733fdd_406) from the
> physical slot (SlotRequestId{9bb6a1762363d3996aded34c82abab54})
> 2021-01-19 02:13:47,117 DEBUG
> org.apache.flink.runtime.scheduler.SharedSlot
> [] - Release shared slot externally
> (SlotRequestId{9bb6a1762363d3996aded34c82abab54})
> 2021-01-19 02:13:47,117 DEBUG
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Releasing
> slot [SlotRequestId{9bb6a1762363d3996aded34c82abab54}] because: Slot is
> being returned from SlotSharingExecutionSlotAllocator.
> 2021-01-19 02:13:47,117 DEBUG
> org.apache.flink.runtime.scheduler.SharedSlot
> [] - Release shared slot (SlotRequestId{9bb6a1762363d3996aded34c82abab54})
> 2021-01-19 02:13:47,117 DEBUG
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Fulfilling
> pending slot request [SlotRequestId{88a7127cc4a86be0a962b9aa68d4feff}] with
> slot [a30c9937af2c6de7ab471086cc9268f5]
> 2021-01-19 02:13:47,117 DEBUG
> org.apache.flink.runtime.scheduler.SharedSlot
> [] - Allocated logical slot
> (SlotRequestId{da65200b5d50dcaaaff3f4373dd824c4}) for execution vertex (id
> 605b35e407e90cda15ad084365733fdd_421) from the physical slot
> (SlotRequestId{88a7127cc4a86be0a962b9aa68d4feff})
> 2021-01-19 02:13:47,117 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> HiveSource-snmpprobe.p_snmp_ifXTable (422/458)
> (f12be47a0d11892e411d1afcb928b55a) switched from SCHEDULED to DEPLOYING.
> 2021-01-19 02:13:47,117 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Deploying
> Source: HiveSource-snmpprobe.p_snmp_ifXTable (422/458) (attempt #0) with
> attempt id f12be47a0d11892e411d1afcb928b55a to 172.0.37.8:6122-694869 @
> 172-0-37-8.flink-taskmanager-query-state.gem-flink.svc.cluster.local
> (dataPort=32959) with allocation id a30c9937af2c6de7ab471086cc9268f5
> 2021-01-19 02:13:47,117 DEBUG
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl [] -
> Cancel slot request 0a1cfa83b0d664615e0b9e1f938d7dee.
> 2021-01-19 02:13:47,117 DEBUG
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl [] -
> Cancel slot request c1d63c4ffdf6e17212c4ca6be4071850.
> 2021-01-19 02:13:47,120 DEBUG
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] -
> WebSocket successfully opened
> 2021-01-19 02:13:47,123 ERROR
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Fatal error occurred in ResourceManager.
> org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Error
> while watching the ConfigMap
> test-flink-etl-cb1c647ea7488765fd3e8cc1dc691e46-jobmanager-leader
>         at
>
> org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver$ConfigMapCallbackHandlerImpl.handleFatalError(KubernetesLeaderRetrievalDriver.java:120)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
>
> org.apache.flink.kubernetes.kubeclient.resources.AbstractKubernetesWatcher.onClose(AbstractKubernetesWatcher.java:48)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
>
> io.fabric8.kubernetes.client.utils.WatcherToggle.onClose(WatcherToggle.java:56)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
>
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.closeEvent(WatchConnectionManager.java:367)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
>
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.access$700(WatchConnectionManager.java:50)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
>
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws
> .RealWebSocket.onReadMessage(RealWebSocket.java:323)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws
> .WebSocketReader.readMessageFrame(WebSocketReader.java:219)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws
> .WebSocketReader.processNextFrame(WebSocketReader.java:105)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws
> .RealWebSocket.loopReader(RealWebSocket.java:274)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws
> .RealWebSocket$2.onResponse(RealWebSocket.java:214)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
>
> org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
>
> org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [?:1.8.0_275]
>         at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [?:1.8.0_275]
>         at java.lang.Thread.run(Thread.java:748) [?:1.8.0_275]
> 2021-01-19 02:13:47,124 ERROR
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Fatal
> error occurred in the cluster entrypoint.
> org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Error
> while watching the ConfigMap
> test-flink-etl-cb1c647ea7488765fd3e8cc1dc691e46-jobmanager-leader
>         at
>
> org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver$ConfigMapCallbackHandlerImpl.handleFatalError(KubernetesLeaderRetrievalDriver.java:120)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
>
> org.apache.flink.kubernetes.kubeclient.resources.AbstractKubernetesWatcher.onClose(AbstractKubernetesWatcher.java:48)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
>
> io.fabric8.kubernetes.client.utils.WatcherToggle.onClose(WatcherToggle.java:56)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
>
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.closeEvent(WatchConnectionManager.java:367)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
>
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.access$700(WatchConnectionManager.java:50)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
>
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws
> .RealWebSocket.onReadMessage(RealWebSocket.java:323)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws
> .WebSocketReader.readMessageFrame(WebSocketReader.java:219)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws
> .WebSocketReader.processNextFrame(WebSocketReader.java:105)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws
> .RealWebSocket.loopReader(RealWebSocket.java:274)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws
> .RealWebSocket$2.onResponse(RealWebSocket.java:214)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
>
> org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
>
> org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
> [flink-dist_2.11-1.12.1.jar:1.12.1]
>         at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [?:1.8.0_275]
>         at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [?:1.8.0_275]
>         at java.lang.Thread.run(Thread.java:748) [?:1.8.0_275]
> 2021-01-19 02:13:47,125 DEBUG
> org.apache.flink.runtime.source.coordinator.SourceCoordinator [] - Handling
> event from subtask 365 of source Source:
> HiveSource-snmpprobe.p_snmp_ifXTable: ReaderRegistrationEvent[subtaskId =
> 365, location = 172.0.37.16)
> 2021-01-19 02:13:47,125 DEBUG
> org.apache.flink.runtime.source.coordinator.SourceCoordinator [] - Handling
> event from subtask 365 of source Source:
> HiveSource-snmpprobe.p_snmp_ifXTable: RequestSplitEvent
> (host='172.0.37.16')
> 2021-01-19 02:13:47,125 INFO
> org.apache.flink.connector.file.src.impl.StaticFileSplitEnumerator [] -
> Subtask 365 (on host '172.0.37.16') is requesting a file source split
> 2021-01-19 02:13:47,125 INFO
> org.apache.flink.connector.file.src.impl.StaticFileSplitEnumerator [] - No
> more splits available for subtask 365
> 2021-01-19 02:13:47,125 DEBUG
> org.apache.flink.runtime.source.coordinator.SourceCoordinator [] - Handling
> event from subtask 379 of source Source:
> HiveSource-snmpprobe.p_snmp_ifXTable: ReaderRegistrationEvent[subtaskId =
> 379, location = 172.0.37.16)
> 2021-01-19 02:13:47,125 DEBUG
> org.apache.flink.runtime.source.coordinator.SourceCoordinator [] - Handling
> event from subtask 379 of source Source:
> HiveSource-snmpprobe.p_snmp_ifXTable: RequestSplitEvent
> (host='172.0.37.16')
> 2021-01-19 02:13:47,125 INFO
> org.apache.flink.connector.file.src.impl.StaticFileSplitEnumerator [] -
> Subtask 379 (on host '172.0.37.16') is requesting a file source split
> 2021-01-19 02:13:47,125 INFO
> org.apache.flink.connector.file.src.impl.StaticFileSplitEnumerator [] - No
> more splits available for subtask 379
> 2021-01-19 02:13:47,130 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> HiveSource-snmpprobe.p_snmp_ifXTable (389/458)
> (b0d8b877b1911ffca609f818693b68ad) switched from DEPLOYING to RUNNING.
> 2021-01-19 02:13:47,131 INFO  org.apache.flink.runtime.blob.BlobServer
>
> [] - Stopped BLOB server at 0.0.0.0:6124
> 2021-01-19 02:13:47,132 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> HiveSource-snmpprobe.p_snmp_ifXTable (421/458)
> (706d012e7a572e1e5786536df9ab3bbb) switched from DEPLOYING to RUNNING.
>
>
>
>
>
> --
> Sent from: http://apache-flink.147419.n8.nabble.com/
>
Reply | Threaded
Open this post in threaded view
|

Re: K8s HA Session模式下1.12.1 jobmanager 周期性 restart

macdoor
可以的,怎么发给你?



--
Sent from: http://apache-flink.147419.n8.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: K8s HA Session模式下1.12.1 jobmanager 周期性 restart

Yang Wang
通过附件或者你上传到第三方的存储,然后在这里共享一下链接

macdoor <[hidden email]> 于2021年1月19日周二 下午12:44写道:

> 可以的,怎么发给你?
>
>
>
> --
> Sent from: http://apache-flink.147419.n8.nabble.com/
>
Reply | Threaded
Open this post in threaded view
|

Re: K8s HA Session模式下1.12.1 jobmanager 周期性 restart

macdoor
Reply | Threaded
Open this post in threaded view
|

Re: K8s HA Session模式下1.12.1 jobmanager 周期性 restart

macdoor
拿到了吗?有什么发现吗?



--
Sent from: http://apache-flink.147419.n8.nabble.com/