大家好,我使用native k8s方式部署flink1.12
HA到k8s一段时间后,jobmanager-leader产生了大量的ConfigMap,这些ConfigMaps都是正常需要的吗?大家都是如何清理和维护的呢? -- Sent from: http://apache-flink.147419.n8.nabble.com/ |
感谢使用K8s的HA mode,你用的是Session模式还是Application模式
* 如果是Application模式,那在flink job达到terminal state(FAILED, CANCELED, SUCCEED)时会自动清理掉所有HA相关的ConfigMap,你可以在webui上面cancel任务或者用flink cancel,然后观察一下,应该不会有残留的 * 如果是Session模式,你提交了很多任务,每个job都会对应一个ConfigMap的,这个ConfigMap的内容会在任务结束以后清理,但ConfigMap还存在,已经有一个ticket[1]来跟进Session模式下改进清理的过程,目前你可以在Session确认不使用的情况下用命令kubectl delete cm --selector='app=<ClusterID>,configmap-type=high-availability'来清理 [1]. https://issues.apache.org/jira/browse/FLINK-20219 Best, Yang tao7 <[hidden email]> 于2020年12月28日周一 上午10:26写道: > 大家好,我使用native k8s方式部署flink1.12 > > HA到k8s一段时间后,jobmanager-leader产生了大量的ConfigMap,这些ConfigMaps都是正常需要的吗?大家都是如何清理和维护的呢? > > > > > -- > Sent from: http://apache-flink.147419.n8.nabble.com/ |
In reply to this post by tao7
您好,我刚刚开始使用 flink 1.12.1 HA on
k8s,发现jobmanager大约半小时左右会restart,都是这种错误,您遇到过吗?谢谢! 2021-01-17 04:52:12,399 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Suspending SlotPool. 2021-01-17 04:52:12,399 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Close ResourceManager connection 28ed7c84e7f395c5a34880df91b251c6: Stopping JobMaster for job p_port_traffic_5m@hive->mysql @2021-01-17 11:40:00(67fb9b15d0deff998e287aa7e2cf1c7b).. 2021-01-17 04:52:12,399 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Stopping SlotPool. 2021-01-17 04:52:12,399 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Disconnect job manager [hidden email]://flink@flink-jobmanager:6123/user/rpc/jobmanager_32 for job 67fb9b15d0deff998e287aa7e2cf1c7b from the resource manager. 2021-01-17 04:52:12,399 INFO org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - Stopping DefaultLeaderElectionService. 2021-01-17 04:52:12,399 INFO org.apache.flink.kubernetes.highavailability.KubernetesLeaderElectionDriver [] - Closing KubernetesLeaderElectionDriver{configMapName='test-flink-etl-67fb9b15d0deff998e287aa7e2cf1c7b-jobmanager-leader'}. 2021-01-17 04:52:12,399 INFO org.apache.flink.kubernetes.kubeclient.resources.KubernetesConfigMapWatcher [] - The watcher is closing. 2021-01-17 04:52:12,416 INFO org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Removed job graph 67fb9b15d0deff998e287aa7e2cf1c7b from KubernetesStateHandleStore{configMapName='test-flink-etl-dispatcher-leader'}. 2021-01-17 04:52:30,686 ERROR org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Fatal error occurred in ResourceManager. org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Error while watching the ConfigMap test-flink-etl-12c0ac13184d3d98af71dadbc4a81d03-jobmanager-leader at org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver$ConfigMapCallbackHandlerImpl.handleFatalError(KubernetesLeaderRetrievalDriver.java:120) [flink-dist_2.11-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.kubeclient.resources.AbstractKubernetesWatcher.onClose(AbstractKubernetesWatcher.java:48) [flink-dist_2.11-1.12.1.jar:1.12.1] at io.fabric8.kubernetes.client.utils.WatcherToggle.onClose(WatcherToggle.java:56) [flink-dist_2.11-1.12.1.jar:1.12.1] at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.closeEvent(WatchConnectionManager.java:367) [flink-dist_2.11-1.12.1.jar:1.12.1] at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.access$700(WatchConnectionManager.java:50) [flink-dist_2.11-1.12.1.jar:1.12.1] at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259) [flink-dist_2.11-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323) [flink-dist_2.11-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219) [flink-dist_2.11-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105) [flink-dist_2.11-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274) [flink-dist_2.11-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214) [flink-dist_2.11-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206) [flink-dist_2.11-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) [flink-dist_2.11-1.12.1.jar:1.12.1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_275] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_275] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_275] 2021-01-17 04:52:30,691 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error occurred in the cluster entrypoint. org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Error while watching the ConfigMap test-flink-etl-12c0ac13184d3d98af71dadbc4a81d03-jobmanager-leader at org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver$ConfigMapCallbackHandlerImpl.handleFatalError(KubernetesLeaderRetrievalDriver.java:120) [flink-dist_2.11-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.kubeclient.resources.AbstractKubernetesWatcher.onClose(AbstractKubernetesWatcher.java:48) [flink-dist_2.11-1.12.1.jar:1.12.1] at io.fabric8.kubernetes.client.utils.WatcherToggle.onClose(WatcherToggle.java:56) [flink-dist_2.11-1.12.1.jar:1.12.1] at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.closeEvent(WatchConnectionManager.java:367) [flink-dist_2.11-1.12.1.jar:1.12.1] at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.access$700(WatchConnectionManager.java:50) [flink-dist_2.11-1.12.1.jar:1.12.1] at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259) [flink-dist_2.11-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323) [flink-dist_2.11-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219) [flink-dist_2.11-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105) [flink-dist_2.11-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274) [flink-dist_2.11-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214) [flink-dist_2.11-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206) [flink-dist_2.11-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) [flink-dist_2.11-1.12.1.jar:1.12.1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_275] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_275] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_275] 2021-01-17 04:52:30,693 INFO org.apache.flink.runtime.blob.BlobServer [] - Stopped BLOB server at 0.0.0.0:6124 -- Sent from: http://apache-flink.147419.n8.nabble.com/ |
看着是Watch的时候报错了,你的K8s环境是怎么样的,如果Pod和K8s APIServer的网络状况不是很稳定会导致这个问题的
我这边在minikube和阿里云的ACK集群都做过测试,长时间运行(超过一周)并没有出现too old resource version等引起的JM重启 鉴于好几个人都反馈有这样的问题,会在1.12的下个bug fix(1.12.2)版本修复一下 Best, Yang macdoor <[hidden email]> 于2021年1月18日周一 上午9:45写道: > 您好,我刚刚开始使用 flink 1.12.1 HA on > k8s,发现jobmanager大约半小时左右会restart,都是这种错误,您遇到过吗?谢谢! > > 2021-01-17 04:52:12,399 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - > Suspending > SlotPool. > 2021-01-17 04:52:12,399 INFO > org.apache.flink.runtime.jobmaster.JobMaster > [] - Close ResourceManager connection 28ed7c84e7f395c5a34880df91b251c6: > Stopping JobMaster for job p_port_traffic_5m@hive->mysql @2021-01-17 > 11:40:00(67fb9b15d0deff998e287aa7e2cf1c7b).. > 2021-01-17 04:52:12,399 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] - Stopping > SlotPool. > 2021-01-17 04:52:12,399 INFO > org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - > Disconnect job manager > [hidden email]://flink@flink-jobmanager > :6123/user/rpc/jobmanager_32 > for job 67fb9b15d0deff998e287aa7e2cf1c7b from the resource manager. > 2021-01-17 04:52:12,399 INFO > org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - > Stopping DefaultLeaderElectionService. > 2021-01-17 04:52:12,399 INFO > org.apache.flink.kubernetes.highavailability.KubernetesLeaderElectionDriver > [] - Closing > > KubernetesLeaderElectionDriver{configMapName='test-flink-etl-67fb9b15d0deff998e287aa7e2cf1c7b-jobmanager-leader'}. > 2021-01-17 04:52:12,399 INFO > org.apache.flink.kubernetes.kubeclient.resources.KubernetesConfigMapWatcher > [] - The watcher is closing. > 2021-01-17 04:52:12,416 INFO > org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Removed > job graph 67fb9b15d0deff998e287aa7e2cf1c7b from > > KubernetesStateHandleStore{configMapName='test-flink-etl-dispatcher-leader'}. > 2021-01-17 04:52:30,686 ERROR > org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - > Fatal error occurred in ResourceManager. > org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Error > while watching the ConfigMap > test-flink-etl-12c0ac13184d3d98af71dadbc4a81d03-jobmanager-leader > at > > org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver$ConfigMapCallbackHandlerImpl.handleFatalError(KubernetesLeaderRetrievalDriver.java:120) > [flink-dist_2.11-1.12.1.jar:1.12.1] > at > > org.apache.flink.kubernetes.kubeclient.resources.AbstractKubernetesWatcher.onClose(AbstractKubernetesWatcher.java:48) > [flink-dist_2.11-1.12.1.jar:1.12.1] > at > > io.fabric8.kubernetes.client.utils.WatcherToggle.onClose(WatcherToggle.java:56) > [flink-dist_2.11-1.12.1.jar:1.12.1] > at > > io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.closeEvent(WatchConnectionManager.java:367) > [flink-dist_2.11-1.12.1.jar:1.12.1] > at > > io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.access$700(WatchConnectionManager.java:50) > [flink-dist_2.11-1.12.1.jar:1.12.1] > at > > io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259) > [flink-dist_2.11-1.12.1.jar:1.12.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws > .RealWebSocket.onReadMessage(RealWebSocket.java:323) > [flink-dist_2.11-1.12.1.jar:1.12.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws > .WebSocketReader.readMessageFrame(WebSocketReader.java:219) > [flink-dist_2.11-1.12.1.jar:1.12.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws > .WebSocketReader.processNextFrame(WebSocketReader.java:105) > [flink-dist_2.11-1.12.1.jar:1.12.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws > .RealWebSocket.loopReader(RealWebSocket.java:274) > [flink-dist_2.11-1.12.1.jar:1.12.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws > .RealWebSocket$2.onResponse(RealWebSocket.java:214) > [flink-dist_2.11-1.12.1.jar:1.12.1] > at > > org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206) > [flink-dist_2.11-1.12.1.jar:1.12.1] > at > > org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) > [flink-dist_2.11-1.12.1.jar:1.12.1] > at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [?:1.8.0_275] > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [?:1.8.0_275] > at java.lang.Thread.run(Thread.java:748) [?:1.8.0_275] > 2021-01-17 04:52:30,691 ERROR > org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal > error occurred in the cluster entrypoint. > org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Error > while watching the ConfigMap > test-flink-etl-12c0ac13184d3d98af71dadbc4a81d03-jobmanager-leader > at > > org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver$ConfigMapCallbackHandlerImpl.handleFatalError(KubernetesLeaderRetrievalDriver.java:120) > [flink-dist_2.11-1.12.1.jar:1.12.1] > at > > org.apache.flink.kubernetes.kubeclient.resources.AbstractKubernetesWatcher.onClose(AbstractKubernetesWatcher.java:48) > [flink-dist_2.11-1.12.1.jar:1.12.1] > at > > io.fabric8.kubernetes.client.utils.WatcherToggle.onClose(WatcherToggle.java:56) > [flink-dist_2.11-1.12.1.jar:1.12.1] > at > > io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.closeEvent(WatchConnectionManager.java:367) > [flink-dist_2.11-1.12.1.jar:1.12.1] > at > > io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.access$700(WatchConnectionManager.java:50) > [flink-dist_2.11-1.12.1.jar:1.12.1] > at > > io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259) > [flink-dist_2.11-1.12.1.jar:1.12.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws > .RealWebSocket.onReadMessage(RealWebSocket.java:323) > [flink-dist_2.11-1.12.1.jar:1.12.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws > .WebSocketReader.readMessageFrame(WebSocketReader.java:219) > [flink-dist_2.11-1.12.1.jar:1.12.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws > .WebSocketReader.processNextFrame(WebSocketReader.java:105) > [flink-dist_2.11-1.12.1.jar:1.12.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws > .RealWebSocket.loopReader(RealWebSocket.java:274) > [flink-dist_2.11-1.12.1.jar:1.12.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws > .RealWebSocket$2.onResponse(RealWebSocket.java:214) > [flink-dist_2.11-1.12.1.jar:1.12.1] > at > > org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206) > [flink-dist_2.11-1.12.1.jar:1.12.1] > at > > org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) > [flink-dist_2.11-1.12.1.jar:1.12.1] > at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [?:1.8.0_275] > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [?:1.8.0_275] > at java.lang.Thread.run(Thread.java:748) [?:1.8.0_275] > 2021-01-17 04:52:30,693 INFO org.apache.flink.runtime.blob.BlobServer > > [] - Stopped BLOB server at 0.0.0.0:6124 > > > > > -- > Sent from: http://apache-flink.147419.n8.nabble.com/ > |
Free forum by Nabble | Edit this page |