native k8s flink leader更新,job恢复失败

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

native k8s flink leader更新,job恢复失败

1120344670
This post was updated on .
您好:
   我们线上flink集群一个leader pod更新leader并加载checkpoint信息失败,我们有两个pod做的k8s原生高可用。

pod1 日志:(也是当时configmap里面保存的leader pod, ip: 10.20.0.39)

2021-04-15 20:42:26,058 INFO
org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector []
- New leader elected 7d4a9b5c-39aa-4103-963b-eaf24ea6435a for
tuiwen-flink-restserver-leader.
2021-04-15 20:42:26,069 INFO
org.apache.flink.runtime.rpc.akka.AkkaRpcService             [] - Starting
RPC endpoint for
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager at
akka://flink/user/rpc/resourcemanager_0 .
2021-04-15 20:42:26,069 INFO
org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint   [] -
http://10.20.0.39:8081 was granted leadership with
leaderSessionID=a314d756-aa7c-4be4-a2a0-14267465d648
2021-04-15 20:42:26,261 INFO
org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector []
- Create KubernetesLeaderElector tuiwen-flink-dispatcher-leader with lock
identity 7d4a9b5c-39aa-4103-963b-eaf24ea6435a.
2021-04-15 20:42:26,660 INFO
org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector []
- New leader elected 6b1aac24-cf40-4aac-bb50-6812290a1f34 for
tuiwen-flink-dispatcher-leader.
2021-04-15 20:42:26,765 INFO
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] -
Starting DefaultLeaderElectionService with
KubernetesLeaderElectionDriver{configMapName='tuiwen-flink-dispatcher-leader'}.
2021-04-15 20:42:26,960 INFO
org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] -
Starting DefaultLeaderRetrievalService with
KubernetesLeaderRetrievalDriver{configMapName='tuiwen-flink-resourcemanager-leader'}.
2021-04-15 20:42:27,258 INFO
org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] -
Starting DefaultLeaderRetrievalService with
KubernetesLeaderRetrievalDriver{configMapName='tuiwen-flink-dispatcher-leader'}.
2021-04-15 20:42:30,457 INFO
org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Recovered
2 pods from previous attempts, current attempt id is 2.
2021-04-15 20:42:30,458 INFO
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
Recovered 2 workers from previous attempt.
2021-04-15 20:42:30,458 INFO
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
Worker tuiwen-flink-taskmanager-1-12 recovered from previous attempt.
2021-04-15 20:42:30,458 INFO
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
Worker tuiwen-flink-taskmanager-1-2 recovered from previous attempt.
2021-04-15 20:42:30,458 INFO
org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector []
- Create KubernetesLeaderElector tuiwen-flink-resourcemanager-leader with
lock identity 7d4a9b5c-39aa-4103-963b-eaf24ea6435a.
2021-04-15 20:42:30,959 INFO
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] -
Starting DefaultLeaderElectionService with
KubernetesLeaderElectionDriver{configMapName='tuiwen-flink-resourcemanager-leader'}.
2021-04-15 20:42:30,978 INFO
org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector []
- New leader elected 6b1aac24-cf40-4aac-bb50-6812290a1f34 for
tuiwen-flink-resourcemanager-leader.
2021-04-15 23:11:15,866 WARN
org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever [] -
Error while retrieving the leader gateway. Retrying to connect to
akka.tcp://flink@10.20.0.39:6123/user/rpc/dispatcher_1.
2021-04-15 23:11:30,626 WARN
org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever [] -
Error while retrieving the leader gateway. Retrying to connect to
akka.tcp://flink@10.20.0.39:6123/user/rpc/dispatcher_1.
2021-04-15 23:11:32,438 WARN
org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever [] -
Error while retrieving the leader gateway. Retrying to connect to
akka.tcp://flink@10.20.0.39:6123/user/rpc/dispatcher_1.
2021-04-15 23:11:33,325 WARN
org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever [] -
Error while retrieving the leader gateway. Retrying to connect to
akka.tcp://flink@10.20.0.39:6123/user/rpc/dispatcher_1.
2021-04-15 23:11:35,948 WARN
org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever [] -
Error while retrieving the leader gateway. Retrying to connect to
akka.tcp://flink@10.20.0.39:6123/user/rpc/dispatcher_1.
2021-04-15 23:11:39,387 WARN
org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever [] -
Error while retrieving the leader gateway. Retrying to connect to
akka.tcp://flink@10.20.0.39:6123/user/rpc/dispatcher_1.
2021-04-15 23:11:40,336 WARN
org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever [] -
Error while retrieving the leader gateway. Retrying to connect to
akka.tcp://flink@10.20.0.39:6123/user/rpc/dispatcher_1.
2021-04-15 23:11:41,485 WARN
org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever [] -
Error while retrieving the leader gateway. Retrying to connect to
akka.tcp://flink@10.20.0.39:6123/user/rpc/dispatcher_1.

pod2 日志:
2021-04-15 20:18:46,969 INFO
org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Creating a
new watch on TaskManager pods.
2021-04-15 20:20:35,979 INFO
org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver
[] - Creating a new watch on ConfigMap
tuiwen-flink-6440d2d65c10d06131376b0420e8adf8-jobmanager-leader.
2021-04-15 20:24:06,209 INFO
org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver
[] - Creating a new watch on ConfigMap
tuiwen-flink-93cf6d866a84cb05815a0f852b3297f1-jobmanager-leader.
2021-04-15 20:31:24,430 INFO
org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver
[] - Creating a new watch on ConfigMap
tuiwen-flink-0d9a578a5689eca5d431814c337afd0e-jobmanager-leader.
2021-04-15 20:33:09,938 INFO
org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver
[] - Creating a new watch on ConfigMap
tuiwen-flink-3f809aa36d8500c185498673430ac0cd-jobmanager-leader.
2021-04-15 20:38:48,594 INFO
org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver
[] - Creating a new watch on ConfigMap
tuiwen-flink-58b7be950819b50c0de022a4b3bcffba-jobmanager-leader.
2021-04-15 20:41:48,424 INFO
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
Disconnect job manager
afac3ea69e3abb20d0ddd6be2504479c@akka.tcp://flink@10.20.0.39:6123/user/rpc/jobmanager_3
for job d5799c4dfc163b612be8106a44c987f1 from the resource manager.
2021-04-15 20:41:49,231 WARN  akka.remote.ReliableDeliverySupervisor                      
[] - Association with remote system
[akka.tcp://flink-metrics@10.20.0.39:45555] has failed, address is now gated
for [50] ms. Reason: [Disassociated]
2021-04-15 20:41:49,259 WARN  akka.remote.ReliableDeliverySupervisor                      
[] - Association with remote system [akka.tcp://flink@10.20.0.39:6123] has
failed, address is now gated for [50] ms. Reason: [Disassociated]
2021-04-15 20:41:49,994 WARN  akka.remote.transport.netty.NettyTransport                  
[] - Remote connection to [null] failed with java.net.ConnectException:
Connection refused: /10.20.0.39:6123
2021-04-15 20:41:49,995 WARN  akka.remote.ReliableDeliverySupervisor                      
[] - Association with remote system [akka.tcp://flink@10.20.0.39:6123] has
failed, address is now gated for [50] ms. Reason: [Association failed with
[akka.tcp://flink@10.20.0.39:6123]] Caused by: [java.net.ConnectException:
Connection refused: /10.20.0.39:6123]
2021-04-15 20:42:00,020 WARN  akka.remote.transport.netty.NettyTransport                  
[] - Remote connection to [null] failed with java.net.ConnectException:
Connection refused: /10.20.0.39:6123
2021-04-15 20:42:00,022 WARN  akka.remote.ReliableDeliverySupervisor                      
[] - Association with remote system [akka.tcp://flink@10.20.0.39:6123] has
failed, address is now gated for [50] ms. Reason: [Association failed with
[akka.tcp://flink@10.20.0.39:6123]] Caused by: [java.net.ConnectException:
Connection refused: /10.20.0.39:6123]
2021-04-15 20:42:05,699 INFO
org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector []
- New leader elected 6b1aac24-cf40-4aac-bb50-6812290a1f34 for
tuiwen-flink-dispatcher-leader.
2021-04-15 20:42:05,700 INFO
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess []
- Start SessionDispatcherLeaderProcess.
2021-04-15 20:42:05,700 INFO
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess []
- Recover all persisted job graphs.

....

2021-04-15 20:42:29,990 INFO
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
The heartbeat of JobManager with id ddebda7152b763ce20e87c6778030ceb timed
out.
2021-04-15 20:42:29,991 INFO
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
Disconnect job manager
97716898a46df03bf2a7c794350b4721@akka.tcp://flink@10.20.0.39:6123/user/rpc/jobmanager_4
for job 2009312fce0c39e75b68b1d1c32da004 from the resource manager.
2021-04-15 20:42:29,991 INFO
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
The heartbeat of JobManager with id 64ff7b6d6a6596a0b669a24629c91d9d timed
out.
2021-04-15 20:42:29,991 INFO
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
Disconnect job manager
86446db021cfd58474b55f7b99f94896@akka.tcp://flink@10.20.0.39:6123/user/rpc/jobmanager_5
for job c2cf28b034ba867b9e7d546592217f75 from the resource manager.
2021-04-15 20:42:29,991 INFO
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
The heartbeat of JobManager with id 3b2e6514f91dce5e2bd6deabdfc0dd8d timed
out.
2021-04-15 20:42:29,991 INFO
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
Disconnect job manager
b368a2f35aba6a003a7a06dda7f94855@akka.tcp://flink@10.20.0.39:6123/user/rpc/jobmanager_2
for job 7b00b2ad7f28b2695f948e0c36c0fedf from the resource manager.
2021-04-15 20:43:09,231 INFO
org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver
[] - Creating a new watch on ConfigMap
tuiwen-flink-e9f42bf5fd2b1ae62510fc2dcf52370d-jobmanager-leader.
2021-04-15 20:43:10,251 WARN
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - Exec
Failure
java.net.SocketTimeoutException: sent ping but didn't receive pong within
30000ms (after 0 successful ping/pongs)
        at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.writePingFrame(RealWebSocket.java:546)
[flink-dist_2.12-1.12.2.jar:1.12.2]
        at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$PingRunnable.run(RealWebSocket.java:530)
[flink-dist_2.12-1.12.2.jar:1.12.2]
        at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[?:1.8.0_282]
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
[?:1.8.0_282]
        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
[?:1.8.0_282]
        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
[?:1.8.0_282]
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_282]
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_282]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
2021-04-15 20:43:11,254 INFO
org.apache.flink.kubernetes.highavailability.KubernetesLeaderElectionDriver
[] - Creating a new watch on ConfigMap tuiwen-flink-dispatcher-leader.
2021-04-15 20:43:46,415 INFO
org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver
[] - Creating a new watch on ConfigMap
tuiwen-flink-f3118def5ccdc03e95960009069effc0-jobmanager-leader.
2021-04-15 20:44:11,290 WARN
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - Exec
Failure
java.net.SocketTimeoutException: sent ping but didn't receive pong within
30000ms (after 0 successful ping/pongs)
        at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.writePingFrame(RealWebSocket.java:546)
[flink-dist_2.12-1.12.2.jar:1.12.2]
        at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$PingRunnable.run(RealWebSocket.java:530)
[flink-dist_2.12-1.12.2.jar:1.12.2]
        at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[?:1.8.0_282]
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
[?:1.8.0_282]
        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
[?:1.8.0_282]
        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
[?:1.8.0_282]
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_282]
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_282]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
2021-04-15 20:44:12,294 INFO
org.apache.flink.kubernetes.highavailability.KubernetesLeaderElectionDriver
[] - Creating a new watch on ConfigMap tuiwen-flink-dispatcher-leader.
....

2021-04-15 20:57:25,490 INFO
org.apache.flink.kubernetes.highavailability.KubernetesLeaderElectionDriver
[] - Creating a new watch on ConfigMap tuiwen-flink-dispatcher-leader.
2021-04-15 20:58:25,493 WARN
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - Exec
Failure
java.net.SocketTimeoutException: sent ping but didn't receive pong within
30000ms (after 0 successful ping/pongs)
        at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.writePingFrame(RealWebSocket.java:546)
[flink-dist_2.12-1.12.2.jar:1.12.2]
        at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$PingRunnable.run(RealWebSocket.java:530)
[flink-dist_2.12-1.12.2.jar:1.12.2]
        at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[?:1.8.0_282]
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
[?:1.8.0_282]
        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
[?:1.8.0_282]
        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
[?:1.8.0_282]
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_282]
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_282]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]

由于日志太多,省略了一些中间的日志。






--
Sent from: http://apache-flink.147419.n8.nabble.com/