flink在yarn集群上启动的问题

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

flink在yarn集群上启动的问题

tanggen911@163.com
您好,我在向yarn 集群提交flink任务时遇到了一些问题,希望能帮忙回答一下
我布署了一个三个节点hadoop集群,两个工作节点为4c24G,yarn-site中配置了8个vcore,可用内存为20G,总共是16vcore 40G的资源,现在我向yarn提交了两个任务,分别分配了3vcore,6G内存,共消耗6vcore,12G内存,从hadoop的web ui上也能反映这一点,如下图:
但是当我提交第三个任务时,却无法提交成功,没有明显的报错日志,可是整个集群的资源明显是充足的,所以不知道问题是出现在哪里,还请多多指教
附1(控制台输出):
The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: The main method caused an error: Could not deploy Yarn job cluster.
at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:302)
at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:198)
at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:149)
at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:699)
at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:232)
at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:916)
at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:992)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:992)
Caused by: org.apache.flink.client.deployment.ClusterDeploymentException: Could not deploy Yarn job cluster.
at org.apache.flink.yarn.YarnClusterDescriptor.deployJobCluster(YarnClusterDescriptor.java:431)
at org.apache.flink.client.deployment.executors.AbstractJobClusterExecutor.execute(AbstractJobClusterExecutor.java:70)
at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1818)
at org.apache.flink.client.program.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:128)
at org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:76)
at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1700)
at com.hfhj.akb.report.order.PlatformOrderStream.main(PlatformOrderStream.java:81)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:288)
... 11 more
Caused by: org.apache.flink.yarn.YarnClusterDescriptor$YarnDeploymentException: The YARN application unexpectedly switched to state FAILED during deployment. 
Diagnostics from YARN: Application application_1618931441017_0004 failed 2 times in previous 10000 milliseconds (global limit =4; local limit is =2) due to AM Container for appattempt_1618931441017_0004_000003 exited with  exitCode: 1
Failing this attempt.Diagnostics: [2021-04-20 23:34:16.067]Exception from container-launch.
Container id: container_1618931441017_0004_03_000001
Exit code: 1

[2021-04-20 23:34:16.069]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :

[2021-04-20 23:34:16.069]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :

For more detailed output, check the application tracking page: http://master107:8088/cluster/app/application_1618931441017_0004 Then click on links to logs of each attempt.
. Failing the application.
If log aggregation is enabled on your cluster, use this command to further investigate the issue:
yarn logs -applicationId application_1618931441017_0004
at org.apache.flink.yarn.YarnClusterDescriptor.startAppMaster(YarnClusterDescriptor.java:1021)
at org.apache.flink.yarn.YarnClusterDescriptor.deployInternal(YarnClusterDescriptor.java:524)
at org.apache.flink.yarn.YarnClusterDescriptor.deployJobCluster(YarnClusterDescriptor.java:424)
... 22 more
附2(hadoop日志):
2021-04-20 23:31:40,293 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Stopping resource-monitoring for container_1618931441017_0004_01_000001
2021-04-20 23:31:40,293 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event CONTAINER_STOP for appId application_1618931441017_0004
2021-04-20 23:31:41,297 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed completed containers from NM context: [container_1618931441017_0004_01_000001]
2021-04-20 23:34:12,558 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1618931441017_0004_000003 (auth:SIMPLE)
2021-04-20 23:34:12,565 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Start request for container_1618931441017_0004_03_000001 by user root
2021-04-20 23:34:12,570 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: TimelineService V2.0 is not enabled. Skipping updating flowContext for application application_1618931441017_0004
2021-04-20 23:34:12,571 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: Adding container_1618931441017_0004_03_000001 to application application_1618931441017_0004
2021-04-20 23:34:12,572 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1618931441017_0004_03_000001 transitioned from NEW to LOCALIZING
2021-04-20 23:34:12,572 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event CONTAINER_INIT for appId application_1618931441017_0004
2021-04-20 23:34:12,574 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1618931441017_0004_03_000001 transitioned from LOCALIZING to SCHEDULED
2021-04-20 23:34:12,574 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler: Starting container [container_1618931441017_0004_03_000001]
2021-04-20 23:34:12,600 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1618931441017_0004_03_000001 transitioned from SCHEDULED to RUNNING
2021-04-20 23:34:12,600 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Starting resource-monitoring for container_1618931441017_0004_03_000001
2021-04-20 23:34:12,603 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: launchContainer: [bash, /opt/hadoop/hadoopdata/nm-local-dir/usercache/root/appcache/application_1618931441017_0004/container_1618931441017_0004_03_000001/default_container_executor.sh]
2021-04-20 23:34:12,905 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: container_1618931441017_0004_03_000001's ip = 10.100.8.108, and hostname = node108
2021-04-20 23:34:12,911 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Skipping monitoring container container_1618931441017_0004_03_000001 since CPU usage is not yet available.
2021-04-20 23:34:16,067 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1618931441017_0004_03_000001 is : 1
2021-04-20 23:34:16,067 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_1618931441017_0004_03_000001 and exit code: 1
ExitCodeException exitCode=1:
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:1009)
        at org.apache.hadoop.util.Shell.run(Shell.java:902)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1227)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:294)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.launchContainer(ContainerLaunch.java:501)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:311)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:106)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2021-04-20 23:34:16,067 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception from container-launch.
2021-04-20 23:34:16,067 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Container id: container_1618931441017_0004_03_000001
2021-04-20 23:34:16,067 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exit code: 1
2021-04-20 23:34:16,067 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container launch failed : Container exited with a non-zero exit code 1.
2021-04-20 23:34:16,069 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1618931441017_0004_03_000001 transitioned from RUNNING to EXITED_WITH_FAILURE
2021-04-20 23:34:16,069 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1618931441017_0004_03_000001