flink kubernetes application频繁重启TaskManager问题

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

flink kubernetes application频繁重启TaskManager问题

casel.chen
最近试用flink kubernetes application时发现TM不断申请再终止,而且设置的LoadBalancer类型的Rest服务一直没有ready,查看不到flink web ui,k8s日志如下,这是什么原因?是因为我申请的资源太小么?


================= 启动参数
"kubernetes.jobmanager.cpu": "0.1",
"kubernetes.taskmanager.cpu": "0.1",
"taskmanager.numberOfTaskSlots": "1",
"jobmanager.memory.process.size": "1024m",
"taskmanager.memory.process.size": "1024m",


================= k8s日志



2021-04-05 09:55:14,777 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - JobManager successfully registered at ResourceManager, leader id: 9903e058fb5ca6f418c78dafcad048f1.
2021-04-05 09:55:14,869 INFO  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Registered job manager [hidden email]://flink@172.17.0.5:6123/user/rpc/jobmanager_2 for job 00000000000000000000000000000000.
2021-04-05 09:55:14,869 INFO  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Registered job manager [hidden email]://flink@172.17.0.5:6123/user/rpc/jobmanager_2 for job 00000000000000000000000000000000.
2021-04-05 09:55:14,870 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Requesting new slot [SlotRequestId{3bcf44c03f742d211b5abcc9d0d35068}] and profile ResourceProfile{UNKNOWN} with allocation id 17bcd11a1d493155e3ed45cfd200be79 from resource manager.
2021-04-05 09:55:14,871 INFO  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Registered job manager [hidden email]://flink@172.17.0.5:6123/user/rpc/jobmanager_2 for job 00000000000000000000000000000000.
2021-04-05 09:55:14,871 INFO  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Request slot with profile ResourceProfile{UNKNOWN} for job 00000000000000000000000000000000 with allocation id 17bcd11a1d493155e3ed45cfd200be79.
2021-04-05 09:55:14,974 INFO  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Requesting new worker with resource spec WorkerResourceSpec {cpuCores=0.1, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb (241591914 bytes)}, current pending count: 1.
2021-04-05 09:55:15,272 INFO  org.apache.flink.runtime.externalresource.ExternalResourceUtils [] - Enabled external resources: []
2021-04-05 09:55:18,570 INFO  org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Creating new TaskManager pod with name flink-k8s-native-application-cluster-taskmanager-1-1 and resource <1024,0.1>.
2021-04-05 09:55:22,669 INFO  org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Pod flink-k8s-native-application-cluster-taskmanager-1-1 is created.
2021-04-05 09:55:22,670 INFO  org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Received new TaskManager pod: flink-k8s-native-application-cluster-taskmanager-1-1
2021-04-05 09:55:22,770 INFO  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Requested worker flink-k8s-native-application-cluster-taskmanager-1-1 with resource spec WorkerResourceSpec {cpuCores=0.1, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb (241591914 bytes)}.
2021-04-05 09:56:35,494 INFO  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Worker flink-k8s-native-application-cluster-taskmanager-1-1 with resource spec WorkerResourceSpec {cpuCores=0.1, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb (241591914 bytes)} was requested in current attempt and has not registered. Current pending count after removing: 0.
2021-04-05 09:56:35,494 INFO  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Worker flink-k8s-native-application-cluster-taskmanager-1-1 is terminated. Diagnostics: Pod terminated, container termination statuses: [flink-task-manager(exitCode=1, reason=Error, message=null)]
2021-04-05 09:56:35,495 INFO  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Requesting new worker with resource spec WorkerResourceSpec {cpuCores=0.1, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb (241591914 bytes)}, current pending count: 1.
2021-04-05 09:56:35,496 INFO  org.apache.flink.runtime.externalresource.ExternalResourceUtils [] - Enabled external resources: []
2021-04-05 09:56:35,498 INFO  org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Creating new TaskManager pod with name flink-k8s-native-application-cluster-taskmanager-1-2 and resource <1024,0.1>.
2021-04-05 09:56:35,700 INFO  org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Pod flink-k8s-native-application-cluster-taskmanager-1-2 is created.
2021-04-05 09:56:35,811 INFO  org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Received new TaskManager pod: flink-k8s-native-application-cluster-taskmanager-1-2
2021-04-05 09:56:35,811 INFO  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Requested worker flink-k8s-native-application-cluster-taskmanager-1-2 with resource spec WorkerResourceSpec {cpuCores=0.1, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb (241591914 bytes)}.
2021-04-05 09:57:56,904 INFO  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Worker flink-k8s-native-application-cluster-taskmanager-1-2 with resource spec WorkerResourceSpec {cpuCores=0.1, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb (241591914 bytes)} was requested in current attempt and has not registered. Current pending count after removing: 0.
2021-04-05 09:57:56,997 INFO  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Worker flink-k8s-native-application-cluster-taskmanager-1-2 is terminated. Diagnostics: Pod terminated, container termination statuses: [flink-task-manager(exitCode=1, reason=Error, message=null)]
2021-04-05 09:57:56,998 INFO  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Requesting new worker with resource spec WorkerResourceSpec {cpuCores=0.1, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb (241591914 bytes)}, current pending count: 1.
2021-04-05 09:57:57,099 INFO  org.apache.flink.runtime.externalresource.ExternalResourceUtils [] - Enabled external resources: []
2021-04-05 09:57:57,199 INFO  org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Creating new TaskManager pod with name flink-k8s-native-application-cluster-taskmanager-1-3 and resource <1024,0.1>.
2021-04-05 09:57:57,800 INFO  org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Pod flink-k8s-native-application-cluster-taskmanager-1-3 is created.
2021-04-05 09:57:58,197 INFO  org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Received new TaskManager pod: flink-k8s-native-application-cluster-taskmanager-1-3
2021-04-05 09:57:58,198 INFO  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Requested worker flink-k8s-native-application-cluster-taskmanager-1-3 with resource spec WorkerResourceSpec {cpuCores=0.1, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb (241591914 bytes)}.
Reply | Threaded
Open this post in threaded view
|

Re: flink kubernetes application频繁重启TaskManager问题

Yang Wang
你的cpu设置这么小,K8s是严格限制的

我怀疑TM启动很慢,一直注册不上来超时导致失败了,你可以看看TM log确认一下

另外,从你发的这个log看,rest endpoint应该已经成功启动了,可以通过<LoadBalancerIP:8081>来进行访问

Best,
Yang

casel.chen <[hidden email]> 于2021年4月5日周一 上午10:05写道:

> 最近试用flink kubernetes
> application时发现TM不断申请再终止,而且设置的LoadBalancer类型的Rest服务一直没有ready,查看不到flink web
> ui,k8s日志如下,这是什么原因?是因为我申请的资源太小么?
>
>
> ================= 启动参数
> "kubernetes.jobmanager.cpu": "0.1",
> "kubernetes.taskmanager.cpu": "0.1",
> "taskmanager.numberOfTaskSlots": "1",
> "jobmanager.memory.process.size": "1024m",
> "taskmanager.memory.process.size": "1024m",
>
>
> ================= k8s日志
>
>
>
> 2021-04-05 09:55:14,777 INFO
> org.apache.flink.runtime.jobmaster.JobMaster                 [] -
> JobManager successfully registered at ResourceManager, leader id:
> 9903e058fb5ca6f418c78dafcad048f1.
> 2021-04-05 09:55:14,869 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Registered job manager [hidden email]://
> flink@172.17.0.5:6123/user/rpc/jobmanager_2 for job
> 00000000000000000000000000000000.
> 2021-04-05 09:55:14,869 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Registered job manager [hidden email]://
> flink@172.17.0.5:6123/user/rpc/jobmanager_2 for job
> 00000000000000000000000000000000.
> 2021-04-05 09:55:14,870 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{3bcf44c03f742d211b5abcc9d0d35068}] and
> profile ResourceProfile{UNKNOWN} with allocation id
> 17bcd11a1d493155e3ed45cfd200be79 from resource manager.
> 2021-04-05 09:55:14,871 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Registered job manager [hidden email]://
> flink@172.17.0.5:6123/user/rpc/jobmanager_2 for job
> 00000000000000000000000000000000.
> 2021-04-05 09:55:14,871 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Request slot with profile ResourceProfile{UNKNOWN} for job
> 00000000000000000000000000000000 with allocation id
> 17bcd11a1d493155e3ed45cfd200be79.
> 2021-04-05 09:55:14,974 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Requesting new worker with resource spec WorkerResourceSpec {cpuCores=0.1,
> taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 bytes,
> networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb
> (241591914 bytes)}, current pending count: 1.
> 2021-04-05 09:55:15,272 INFO
> org.apache.flink.runtime.externalresource.ExternalResourceUtils [] -
> Enabled external resources: []
> 2021-04-05 09:55:18,570 INFO
> org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Creating
> new TaskManager pod with name
> flink-k8s-native-application-cluster-taskmanager-1-1 and resource
> <1024,0.1>.
> 2021-04-05 09:55:22,669 INFO
> org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Pod
> flink-k8s-native-application-cluster-taskmanager-1-1 is created.
> 2021-04-05 09:55:22,670 INFO
> org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Received
> new TaskManager pod: flink-k8s-native-application-cluster-taskmanager-1-1
> 2021-04-05 09:55:22,770 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Requested worker flink-k8s-native-application-cluster-taskmanager-1-1 with
> resource spec WorkerResourceSpec {cpuCores=0.1, taskHeapSize=25.600mb
> (26843542 bytes), taskOffHeapSize=0 bytes, networkMemSize=64.000mb
> (67108864 bytes), managedMemSize=230.400mb (241591914 bytes)}.
> 2021-04-05 09:56:35,494 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Worker flink-k8s-native-application-cluster-taskmanager-1-1 with resource
> spec WorkerResourceSpec {cpuCores=0.1, taskHeapSize=25.600mb (26843542
> bytes), taskOffHeapSize=0 bytes, networkMemSize=64.000mb (67108864 bytes),
> managedMemSize=230.400mb (241591914 bytes)} was requested in current
> attempt and has not registered. Current pending count after removing: 0.
> 2021-04-05 09:56:35,494 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Worker flink-k8s-native-application-cluster-taskmanager-1-1 is terminated.
> Diagnostics: Pod terminated, container termination statuses:
> [flink-task-manager(exitCode=1, reason=Error, message=null)]
> 2021-04-05 09:56:35,495 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Requesting new worker with resource spec WorkerResourceSpec {cpuCores=0.1,
> taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 bytes,
> networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb
> (241591914 bytes)}, current pending count: 1.
> 2021-04-05 09:56:35,496 INFO
> org.apache.flink.runtime.externalresource.ExternalResourceUtils [] -
> Enabled external resources: []
> 2021-04-05 09:56:35,498 INFO
> org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Creating
> new TaskManager pod with name
> flink-k8s-native-application-cluster-taskmanager-1-2 and resource
> <1024,0.1>.
> 2021-04-05 09:56:35,700 INFO
> org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Pod
> flink-k8s-native-application-cluster-taskmanager-1-2 is created.
> 2021-04-05 09:56:35,811 INFO
> org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Received
> new TaskManager pod: flink-k8s-native-application-cluster-taskmanager-1-2
> 2021-04-05 09:56:35,811 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Requested worker flink-k8s-native-application-cluster-taskmanager-1-2 with
> resource spec WorkerResourceSpec {cpuCores=0.1, taskHeapSize=25.600mb
> (26843542 bytes), taskOffHeapSize=0 bytes, networkMemSize=64.000mb
> (67108864 bytes), managedMemSize=230.400mb (241591914 bytes)}.
> 2021-04-05 09:57:56,904 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Worker flink-k8s-native-application-cluster-taskmanager-1-2 with resource
> spec WorkerResourceSpec {cpuCores=0.1, taskHeapSize=25.600mb (26843542
> bytes), taskOffHeapSize=0 bytes, networkMemSize=64.000mb (67108864 bytes),
> managedMemSize=230.400mb (241591914 bytes)} was requested in current
> attempt and has not registered. Current pending count after removing: 0.
> 2021-04-05 09:57:56,997 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Worker flink-k8s-native-application-cluster-taskmanager-1-2 is terminated.
> Diagnostics: Pod terminated, container termination statuses:
> [flink-task-manager(exitCode=1, reason=Error, message=null)]
> 2021-04-05 09:57:56,998 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Requesting new worker with resource spec WorkerResourceSpec {cpuCores=0.1,
> taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 bytes,
> networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb
> (241591914 bytes)}, current pending count: 1.
> 2021-04-05 09:57:57,099 INFO
> org.apache.flink.runtime.externalresource.ExternalResourceUtils [] -
> Enabled external resources: []
> 2021-04-05 09:57:57,199 INFO
> org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Creating
> new TaskManager pod with name
> flink-k8s-native-application-cluster-taskmanager-1-3 and resource
> <1024,0.1>.
> 2021-04-05 09:57:57,800 INFO
> org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Pod
> flink-k8s-native-application-cluster-taskmanager-1-3 is created.
> 2021-04-05 09:57:58,197 INFO
> org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Received
> new TaskManager pod: flink-k8s-native-application-cluster-taskmanager-1-3
> 2021-04-05 09:57:58,198 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Requested worker flink-k8s-native-application-cluster-taskmanager-1-3 with
> resource spec WorkerResourceSpec {cpuCores=0.1, taskHeapSize=25.600mb
> (26843542 bytes), taskOffHeapSize=0 bytes, networkMemSize=64.000mb
> (67108864 bytes), managedMemSize=230.400mb (241591914 bytes)}.