This post was updated on .
您好:
我们线上flink集群一个leader pod更新leader并加载checkpoint信息失败,我们有两个pod做的k8s原生高可用。 pod1 日志:(也是当时configmap里面保存的leader pod, ip: 10.20.0.39) 2021-04-15 20:42:26,058 INFO org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector [] - New leader elected 7d4a9b5c-39aa-4103-963b-eaf24ea6435a for tuiwen-flink-restserver-leader. 2021-04-15 20:42:26,069 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService [] - Starting RPC endpoint for org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager at akka://flink/user/rpc/resourcemanager_0 . 2021-04-15 20:42:26,069 INFO org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint [] - http://10.20.0.39:8081 was granted leadership with leaderSessionID=a314d756-aa7c-4be4-a2a0-14267465d648 2021-04-15 20:42:26,261 INFO org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector [] - Create KubernetesLeaderElector tuiwen-flink-dispatcher-leader with lock identity 7d4a9b5c-39aa-4103-963b-eaf24ea6435a. 2021-04-15 20:42:26,660 INFO org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector [] - New leader elected 6b1aac24-cf40-4aac-bb50-6812290a1f34 for tuiwen-flink-dispatcher-leader. 2021-04-15 20:42:26,765 INFO org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - Starting DefaultLeaderElectionService with KubernetesLeaderElectionDriver{configMapName='tuiwen-flink-dispatcher-leader'}. 2021-04-15 20:42:26,960 INFO org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - Starting DefaultLeaderRetrievalService with KubernetesLeaderRetrievalDriver{configMapName='tuiwen-flink-resourcemanager-leader'}. 2021-04-15 20:42:27,258 INFO org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - Starting DefaultLeaderRetrievalService with KubernetesLeaderRetrievalDriver{configMapName='tuiwen-flink-dispatcher-leader'}. 2021-04-15 20:42:30,457 INFO org.apache.flink.kubernetes.KubernetesResourceManagerDriver [] - Recovered 2 pods from previous attempts, current attempt id is 2. 2021-04-15 20:42:30,458 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Recovered 2 workers from previous attempt. 2021-04-15 20:42:30,458 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Worker tuiwen-flink-taskmanager-1-12 recovered from previous attempt. 2021-04-15 20:42:30,458 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Worker tuiwen-flink-taskmanager-1-2 recovered from previous attempt. 2021-04-15 20:42:30,458 INFO org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector [] - Create KubernetesLeaderElector tuiwen-flink-resourcemanager-leader with lock identity 7d4a9b5c-39aa-4103-963b-eaf24ea6435a. 2021-04-15 20:42:30,959 INFO org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - Starting DefaultLeaderElectionService with KubernetesLeaderElectionDriver{configMapName='tuiwen-flink-resourcemanager-leader'}. 2021-04-15 20:42:30,978 INFO org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector [] - New leader elected 6b1aac24-cf40-4aac-bb50-6812290a1f34 for tuiwen-flink-resourcemanager-leader. 2021-04-15 23:11:15,866 WARN org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever [] - Error while retrieving the leader gateway. Retrying to connect to akka.tcp://flink@10.20.0.39:6123/user/rpc/dispatcher_1. 2021-04-15 23:11:30,626 WARN org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever [] - Error while retrieving the leader gateway. Retrying to connect to akka.tcp://flink@10.20.0.39:6123/user/rpc/dispatcher_1. 2021-04-15 23:11:32,438 WARN org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever [] - Error while retrieving the leader gateway. Retrying to connect to akka.tcp://flink@10.20.0.39:6123/user/rpc/dispatcher_1. 2021-04-15 23:11:33,325 WARN org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever [] - Error while retrieving the leader gateway. Retrying to connect to akka.tcp://flink@10.20.0.39:6123/user/rpc/dispatcher_1. 2021-04-15 23:11:35,948 WARN org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever [] - Error while retrieving the leader gateway. Retrying to connect to akka.tcp://flink@10.20.0.39:6123/user/rpc/dispatcher_1. 2021-04-15 23:11:39,387 WARN org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever [] - Error while retrieving the leader gateway. Retrying to connect to akka.tcp://flink@10.20.0.39:6123/user/rpc/dispatcher_1. 2021-04-15 23:11:40,336 WARN org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever [] - Error while retrieving the leader gateway. Retrying to connect to akka.tcp://flink@10.20.0.39:6123/user/rpc/dispatcher_1. 2021-04-15 23:11:41,485 WARN org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever [] - Error while retrieving the leader gateway. Retrying to connect to akka.tcp://flink@10.20.0.39:6123/user/rpc/dispatcher_1. pod2 日志: 2021-04-15 20:18:46,969 INFO org.apache.flink.kubernetes.KubernetesResourceManagerDriver [] - Creating a new watch on TaskManager pods. 2021-04-15 20:20:35,979 INFO org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver [] - Creating a new watch on ConfigMap tuiwen-flink-6440d2d65c10d06131376b0420e8adf8-jobmanager-leader. 2021-04-15 20:24:06,209 INFO org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver [] - Creating a new watch on ConfigMap tuiwen-flink-93cf6d866a84cb05815a0f852b3297f1-jobmanager-leader. 2021-04-15 20:31:24,430 INFO org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver [] - Creating a new watch on ConfigMap tuiwen-flink-0d9a578a5689eca5d431814c337afd0e-jobmanager-leader. 2021-04-15 20:33:09,938 INFO org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver [] - Creating a new watch on ConfigMap tuiwen-flink-3f809aa36d8500c185498673430ac0cd-jobmanager-leader. 2021-04-15 20:38:48,594 INFO org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver [] - Creating a new watch on ConfigMap tuiwen-flink-58b7be950819b50c0de022a4b3bcffba-jobmanager-leader. 2021-04-15 20:41:48,424 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Disconnect job manager afac3ea69e3abb20d0ddd6be2504479c@akka.tcp://flink@10.20.0.39:6123/user/rpc/jobmanager_3 for job d5799c4dfc163b612be8106a44c987f1 from the resource manager. 2021-04-15 20:41:49,231 WARN akka.remote.ReliableDeliverySupervisor [] - Association with remote system [akka.tcp://flink-metrics@10.20.0.39:45555] has failed, address is now gated for [50] ms. Reason: [Disassociated] 2021-04-15 20:41:49,259 WARN akka.remote.ReliableDeliverySupervisor [] - Association with remote system [akka.tcp://flink@10.20.0.39:6123] has failed, address is now gated for [50] ms. Reason: [Disassociated] 2021-04-15 20:41:49,994 WARN akka.remote.transport.netty.NettyTransport [] - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /10.20.0.39:6123 2021-04-15 20:41:49,995 WARN akka.remote.ReliableDeliverySupervisor [] - Association with remote system [akka.tcp://flink@10.20.0.39:6123] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@10.20.0.39:6123]] Caused by: [java.net.ConnectException: Connection refused: /10.20.0.39:6123] 2021-04-15 20:42:00,020 WARN akka.remote.transport.netty.NettyTransport [] - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /10.20.0.39:6123 2021-04-15 20:42:00,022 WARN akka.remote.ReliableDeliverySupervisor [] - Association with remote system [akka.tcp://flink@10.20.0.39:6123] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@10.20.0.39:6123]] Caused by: [java.net.ConnectException: Connection refused: /10.20.0.39:6123] 2021-04-15 20:42:05,699 INFO org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector [] - New leader elected 6b1aac24-cf40-4aac-bb50-6812290a1f34 for tuiwen-flink-dispatcher-leader. 2021-04-15 20:42:05,700 INFO org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] - Start SessionDispatcherLeaderProcess. 2021-04-15 20:42:05,700 INFO org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] - Recover all persisted job graphs. .... 2021-04-15 20:42:29,990 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - The heartbeat of JobManager with id ddebda7152b763ce20e87c6778030ceb timed out. 2021-04-15 20:42:29,991 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Disconnect job manager 97716898a46df03bf2a7c794350b4721@akka.tcp://flink@10.20.0.39:6123/user/rpc/jobmanager_4 for job 2009312fce0c39e75b68b1d1c32da004 from the resource manager. 2021-04-15 20:42:29,991 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - The heartbeat of JobManager with id 64ff7b6d6a6596a0b669a24629c91d9d timed out. 2021-04-15 20:42:29,991 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Disconnect job manager 86446db021cfd58474b55f7b99f94896@akka.tcp://flink@10.20.0.39:6123/user/rpc/jobmanager_5 for job c2cf28b034ba867b9e7d546592217f75 from the resource manager. 2021-04-15 20:42:29,991 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - The heartbeat of JobManager with id 3b2e6514f91dce5e2bd6deabdfc0dd8d timed out. 2021-04-15 20:42:29,991 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Disconnect job manager b368a2f35aba6a003a7a06dda7f94855@akka.tcp://flink@10.20.0.39:6123/user/rpc/jobmanager_2 for job 7b00b2ad7f28b2695f948e0c36c0fedf from the resource manager. 2021-04-15 20:43:09,231 INFO org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver [] - Creating a new watch on ConfigMap tuiwen-flink-e9f42bf5fd2b1ae62510fc2dcf52370d-jobmanager-leader. 2021-04-15 20:43:10,251 WARN io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - Exec Failure java.net.SocketTimeoutException: sent ping but didn't receive pong within 30000ms (after 0 successful ping/pongs) at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.writePingFrame(RealWebSocket.java:546) [flink-dist_2.12-1.12.2.jar:1.12.2] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$PingRunnable.run(RealWebSocket.java:530) [flink-dist_2.12-1.12.2.jar:1.12.2] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_282] at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_282] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_282] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_282] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_282] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_282] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282] 2021-04-15 20:43:11,254 INFO org.apache.flink.kubernetes.highavailability.KubernetesLeaderElectionDriver [] - Creating a new watch on ConfigMap tuiwen-flink-dispatcher-leader. 2021-04-15 20:43:46,415 INFO org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver [] - Creating a new watch on ConfigMap tuiwen-flink-f3118def5ccdc03e95960009069effc0-jobmanager-leader. 2021-04-15 20:44:11,290 WARN io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - Exec Failure java.net.SocketTimeoutException: sent ping but didn't receive pong within 30000ms (after 0 successful ping/pongs) at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.writePingFrame(RealWebSocket.java:546) [flink-dist_2.12-1.12.2.jar:1.12.2] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$PingRunnable.run(RealWebSocket.java:530) [flink-dist_2.12-1.12.2.jar:1.12.2] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_282] at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_282] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_282] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_282] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_282] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_282] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282] 2021-04-15 20:44:12,294 INFO org.apache.flink.kubernetes.highavailability.KubernetesLeaderElectionDriver [] - Creating a new watch on ConfigMap tuiwen-flink-dispatcher-leader. .... 2021-04-15 20:57:25,490 INFO org.apache.flink.kubernetes.highavailability.KubernetesLeaderElectionDriver [] - Creating a new watch on ConfigMap tuiwen-flink-dispatcher-leader. 2021-04-15 20:58:25,493 WARN io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager [] - Exec Failure java.net.SocketTimeoutException: sent ping but didn't receive pong within 30000ms (after 0 successful ping/pongs) at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.writePingFrame(RealWebSocket.java:546) [flink-dist_2.12-1.12.2.jar:1.12.2] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$PingRunnable.run(RealWebSocket.java:530) [flink-dist_2.12-1.12.2.jar:1.12.2] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_282] at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_282] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_282] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_282] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_282] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_282] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282] 由于日志太多,省略了一些中间的日志。 -- Sent from: http://apache-flink.147419.n8.nabble.com/ |
Free forum by Nabble | Edit this page |