大家好
目前环境是flink 1.7.2,使用YARN Session模式提交任务,Hadoop 版本2.7.3,hdfs namenode配置了ha模式,提交任务的时候报以下错误,系统环境变量中已经设置了HADOOP_HOME,YARN_CONF_DIR,HADOOP_CONF_DIR,HADOOP_CLASSPATH,在flink_conf.yaml中配置了fs.hdfs.hadoopconf 2020-04-10 19:12:02,908 INFO org.apache.flink.runtime.jobmaster.JobMaster - Connecting to ResourceManager akka.tcp://flink@trusfortpoc1:23584/user/resourcemanager(00000000000000000000000000000000) 2020-04-10 19:12:02,909 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{0feacbb4fe16c8c7a70249f1396565d0}] 2020-04-10 19:12:02,911 INFO org.apache.flink.runtime.jobmaster.JobMaster - Resolved ResourceManager address, beginning registration 2020-04-10 19:12:02,911 INFO org.apache.flink.runtime.jobmaster.JobMaster - Registration at ResourceManager attempt 1 (timeout=100ms) 2020-04-10 19:12:02,912 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{35ad2384e9cd0efd30b43f5302db24b6}] 2020-04-10 19:12:02,913 INFO org.apache.flink.yarn.YarnResourceManager - Registering job manager [hidden email]://flink@trusfortpoc1:23584/user/jobmanager_0 for job 24691b33c18d7ad73b1f52edb3d68ae4. 2020-04-10 19:12:02,917 INFO org.apache.flink.yarn.YarnResourceManager - Registered job manager [hidden email]://flink@trusfortpoc1:23584/user/jobmanager_0 for job 24691b33c18d7ad73b1f52edb3d68ae4. 2020-04-10 19:12:02,919 INFO org.apache.flink.runtime.jobmaster.JobMaster - JobManager successfully registered at ResourceManager, leader id: 00000000000000000000000000000000. 2020-04-10 19:12:02,919 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Requesting new slot [SlotRequestId{35ad2384e9cd0efd30b43f5302db24b6}] and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} from resource manager. 2020-04-10 19:12:02,920 INFO org.apache.flink.yarn.YarnResourceManager - Request slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} for job 24691b33c18d7ad73b1f52edb3d68ae4 with allocation id AllocationID{5a12237c7f2bd8b1cc760ddcbab5a1c0}. 2020-04-10 19:12:02,921 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Requesting new slot [SlotRequestId{0feacbb4fe16c8c7a70249f1396565d0}] and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} from resource manager. 2020-04-10 19:12:02,924 INFO org.apache.flink.yarn.YarnResourceManager - Requesting new TaskExecutor container with resources <memory:4096, vCores:6>. Number pending requests 1. 2020-04-10 19:12:02,926 INFO org.apache.flink.yarn.YarnResourceManager - Request slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} for job 24691b33c18d7ad73b1f52edb3d68ae4 with allocation id AllocationID{37dd666a18040bf63ffbf2e022b2ea9b}. 2020-04-10 19:12:06,531 INFO org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl - Received new token for : trusfortpoc3:35206 2020-04-10 19:12:06,543 INFO org.apache.flink.yarn.YarnResourceManager - Received new container: container_1586426824930_0006_01_000002 - Remaining pending container requests: 1 2020-04-10 19:12:06,543 INFO org.apache.flink.yarn.YarnResourceManager - Removing container request Capability[<memory:4096, vCores:6>]Priority[1]. Pending container requests 0. 2020-04-10 19:12:06,568 ERROR org.apache.flink.yarn.YarnResourceManager - Could not start TaskManager in container container_1586426824930_0006_01_000002. java.lang.IllegalArgumentException: java.net.UnknownHostException: hdfsClusterForML at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:378) at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:320) at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176) at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:687) at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:628) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2701) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2683) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:372) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.flink.yarn.Utils.createTaskExecutorContext(Utils.java:453) at org.apache.flink.yarn.YarnResourceManager.createTaskExecutorLaunchContext(YarnResourceManager.java:555) at org.apache.flink.yarn.YarnResourceManager.lambda$onContainersAllocated$1(YarnResourceManager.java:390) at org.apache.flink.yarn.YarnResourceManager$$Lambda$183/1182651376.run(Unknown Source) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158) at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142) at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40) at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165) at akka.actor.Actor$class.aroundReceive(Actor.scala:502) at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) at akka.actor.ActorCell.invoke(ActorCell.scala:495) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) at akka.dispatch.Mailbox.run(Mailbox.scala:224) at akka.dispatch.Mailbox.exec(Mailbox.scala:234) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: java.net.UnknownHostException: hdfsClusterForML ... 33 more 这个hdfsClusterForML是namenode ha 的nameservice,经过分析是没加载hdfs-site.xml配置导致的, 也尝试过把Hadoop的几个配置文件放到flink 的conf目录下但都无效,最终通过改YarnResourceManager源码后能够正常提交任务。 public YarnResourceManager( RpcService rpcService, String resourceManagerEndpointId, ResourceID resourceId, Configuration flinkConfig, Map<String, String> env, HighAvailabilityServices highAvailabilityServices, HeartbeatServices heartbeatServices, SlotManager slotManager, MetricRegistry metricRegistry, JobLeaderIdService jobLeaderIdService, ClusterInformation clusterInformation, FatalErrorHandler fatalErrorHandler, @Nullable String webInterfaceUrl, JobManagerMetricGroup jobManagerMetricGroup) { super( rpcService, resourceManagerEndpointId, resourceId, highAvailabilityServices, heartbeatServices, slotManager, metricRegistry, jobLeaderIdService, clusterInformation, fatalErrorHandler, jobManagerMetricGroup); this.flinkConfig = flinkConfig; this.yarnConfig = new YarnConfiguration(HadoopUtils.getHadoopConfiguration(flinkConfig)); 但我认为这肯定不是解决问题的方法,所以向大家求助,是不是我忽略什么。 |
Flink需要设置hadoop相关conf位置的环境变量 YARN_CONF_DIR or HADOOP_CONF_DIR [1]
[1] https://ci.apache.org/projects/flink/flink-docs-master/ops/deployment/yarn_setup.html Best, Yangze Guo On Mon, Apr 13, 2020 at 10:52 PM Chief <[hidden email]> wrote: > > 大家好 > 目前环境是flink 1.7.2,使用YARN Session模式提交任务,Hadoop 版本2.7.3,hdfs namenode配置了ha模式,提交任务的时候报以下错误,系统环境变量中已经设置了HADOOP_HOME,YARN_CONF_DIR,HADOOP_CONF_DIR,HADOOP_CLASSPATH,在flink_conf.yaml中配置了fs.hdfs.hadoopconf > > > 2020-04-10 19:12:02,908 INFO org.apache.flink.runtime.jobmaster.JobMaster - Connecting to ResourceManager akka.tcp://flink@trusfortpoc1:23584/user/resourcemanager(00000000000000000000000000000000) > 2020-04-10 19:12:02,909 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{0feacbb4fe16c8c7a70249f1396565d0}] > 2020-04-10 19:12:02,911 INFO org.apache.flink.runtime.jobmaster.JobMaster - Resolved ResourceManager address, beginning registration > 2020-04-10 19:12:02,911 INFO org.apache.flink.runtime.jobmaster.JobMaster - Registration at ResourceManager attempt 1 (timeout=100ms) > 2020-04-10 19:12:02,912 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{35ad2384e9cd0efd30b43f5302db24b6}] > 2020-04-10 19:12:02,913 INFO org.apache.flink.yarn.YarnResourceManager - Registering job manager [hidden email]://flink@trusfortpoc1:23584/user/jobmanager_0 for job 24691b33c18d7ad73b1f52edb3d68ae4. > 2020-04-10 19:12:02,917 INFO org.apache.flink.yarn.YarnResourceManager - Registered job manager [hidden email]://flink@trusfortpoc1:23584/user/jobmanager_0 for job 24691b33c18d7ad73b1f52edb3d68ae4. > 2020-04-10 19:12:02,919 INFO org.apache.flink.runtime.jobmaster.JobMaster - JobManager successfully registered at ResourceManager, leader id: 00000000000000000000000000000000. > 2020-04-10 19:12:02,919 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Requesting new slot [SlotRequestId{35ad2384e9cd0efd30b43f5302db24b6}] and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} from resource manager. > 2020-04-10 19:12:02,920 INFO org.apache.flink.yarn.YarnResourceManager - Request slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} for job 24691b33c18d7ad73b1f52edb3d68ae4 with allocation id AllocationID{5a12237c7f2bd8b1cc760ddcbab5a1c0}. > 2020-04-10 19:12:02,921 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Requesting new slot [SlotRequestId{0feacbb4fe16c8c7a70249f1396565d0}] and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} from resource manager. > 2020-04-10 19:12:02,924 INFO org.apache.flink.yarn.YarnResourceManager - Requesting new TaskExecutor container with resources <memory:4096, vCores:6>. Number pending requests 1. > 2020-04-10 19:12:02,926 INFO org.apache.flink.yarn.YarnResourceManager - Request slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} for job 24691b33c18d7ad73b1f52edb3d68ae4 with allocation id AllocationID{37dd666a18040bf63ffbf2e022b2ea9b}. > 2020-04-10 19:12:06,531 INFO org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl - Received new token for : trusfortpoc3:35206 > 2020-04-10 19:12:06,543 INFO org.apache.flink.yarn.YarnResourceManager - Received new container: container_1586426824930_0006_01_000002 - Remaining pending container requests: 1 > 2020-04-10 19:12:06,543 INFO org.apache.flink.yarn.YarnResourceManager - Removing container request Capability[<memory:4096, vCores:6>]Priority[1]. Pending container requests 0. > 2020-04-10 19:12:06,568 ERROR org.apache.flink.yarn.YarnResourceManager - Could not start TaskManager in container container_1586426824930_0006_01_000002. > java.lang.IllegalArgumentException: java.net.UnknownHostException: hdfsClusterForML > at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:378) > at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:320) > at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176) > at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:687) > at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:628) > at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149) > at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93) > at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2701) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2683) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:372) > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) > at org.apache.flink.yarn.Utils.createTaskExecutorContext(Utils.java:453) > at org.apache.flink.yarn.YarnResourceManager.createTaskExecutorLaunchContext(YarnResourceManager.java:555) > at org.apache.flink.yarn.YarnResourceManager.lambda$onContainersAllocated$1(YarnResourceManager.java:390) > at org.apache.flink.yarn.YarnResourceManager$$Lambda$183/1182651376.run(Unknown Source) > at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332) > at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158) > at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70) > at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142) > at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40) > at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165) > at akka.actor.Actor$class.aroundReceive(Actor.scala:502) > at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) > at akka.actor.ActorCell.invoke(ActorCell.scala:495) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) > at akka.dispatch.Mailbox.run(Mailbox.scala:224) > at akka.dispatch.Mailbox.exec(Mailbox.scala:234) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Caused by: java.net.UnknownHostException: hdfsClusterForML > ... 33 more > > > 这个hdfsClusterForML是namenode ha 的nameservice,经过分析是没加载hdfs-site.xml配置导致的, > 也尝试过把Hadoop的几个配置文件放到flink 的conf目录下但都无效,最终通过改YarnResourceManager源码后能够正常提交任务。 > public YarnResourceManager( > RpcService rpcService, > String resourceManagerEndpointId, > ResourceID resourceId, > Configuration flinkConfig, > Map<String, String> env, > HighAvailabilityServices highAvailabilityServices, > HeartbeatServices heartbeatServices, > SlotManager slotManager, > MetricRegistry metricRegistry, > JobLeaderIdService jobLeaderIdService, > ClusterInformation clusterInformation, > FatalErrorHandler fatalErrorHandler, > @Nullable String webInterfaceUrl, > JobManagerMetricGroup jobManagerMetricGroup) { > super( > rpcService, > resourceManagerEndpointId, > resourceId, > highAvailabilityServices, > heartbeatServices, > slotManager, > metricRegistry, > jobLeaderIdService, > clusterInformation, > fatalErrorHandler, > jobManagerMetricGroup); > this.flinkConfig = flinkConfig; > this.yarnConfig = new YarnConfiguration(HadoopUtils.getHadoopConfiguration(flinkConfig)); > 但我认为这肯定不是解决问题的方法,所以向大家求助,是不是我忽略什么。 |
hi Yangze Guo
您说的环境变量已经在当前用户的环境变量文件里面设置了,您可以看看我的问题描述,现在如果checkpoint的路径设置不是namenode ha的nameservice就不会报错,checkpoint都正常。 ------------------ 原始邮件 ------------------ 发件人: "Yangze Guo"<[hidden email]>; 发送时间: 2020年4月15日(星期三) 下午3:00 收件人: "user-zh"<[hidden email]>; 主题: Re: flink 1.7.2 YARN Session模式提交任务问题求助 Flink需要设置hadoop相关conf位置的环境变量 YARN_CONF_DIR or HADOOP_CONF_DIR [1] [1] https://ci.apache.org/projects/flink/flink-docs-master/ops/deployment/yarn_setup.html Best, Yangze Guo On Mon, Apr 13, 2020 at 10:52 PM Chief <[hidden email]> wrote: > > 大家好 > 目前环境是flink 1.7.2,使用YARN Session模式提交任务,Hadoop 版本2.7.3,hdfs namenode配置了ha模式,提交任务的时候报以下错误,系统环境变量中已经设置了HADOOP_HOME,YARN_CONF_DIR,HADOOP_CONF_DIR,HADOOP_CLASSPATH,在flink_conf.yaml中配置了fs.hdfs.hadoopconf > > > 2020-04-10 19:12:02,908 INFO&nbsp; org.apache.flink.runtime.jobmaster.JobMaster&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; - Connecting to ResourceManager akka.tcp://flink@trusfortpoc1:23584/user/resourcemanager(00000000000000000000000000000000) > 2020-04-10 19:12:02,909 INFO&nbsp; org.apache.flink.runtime.jobmaster.slotpool.SlotPool&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{0feacbb4fe16c8c7a70249f1396565d0}] > 2020-04-10 19:12:02,911 INFO&nbsp; org.apache.flink.runtime.jobmaster.JobMaster&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; - Resolved ResourceManager address, beginning registration > 2020-04-10 19:12:02,911 INFO&nbsp; org.apache.flink.runtime.jobmaster.JobMaster&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; - Registration at ResourceManager attempt 1 (timeout=100ms) > 2020-04-10 19:12:02,912 INFO&nbsp; org.apache.flink.runtime.jobmaster.slotpool.SlotPool&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{35ad2384e9cd0efd30b43f5302db24b6}] > 2020-04-10 19:12:02,913 INFO&nbsp; org.apache.flink.yarn.YarnResourceManager&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;- Registering job manager [hidden email]://flink@trusfortpoc1:23584/user/jobmanager_0 for job 24691b33c18d7ad73b1f52edb3d68ae4. > 2020-04-10 19:12:02,917 INFO&nbsp; org.apache.flink.yarn.YarnResourceManager&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;- Registered job manager [hidden email]://flink@trusfortpoc1:23584/user/jobmanager_0 for job 24691b33c18d7ad73b1f52edb3d68ae4. > 2020-04-10 19:12:02,919 INFO&nbsp; org.apache.flink.runtime.jobmaster.JobMaster&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; - JobManager successfully registered at ResourceManager, leader id: 00000000000000000000000000000000. > 2020-04-10 19:12:02,919 INFO&nbsp; org.apache.flink.runtime.jobmaster.slotpool.SlotPool&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; - Requesting new slot [SlotRequestId{35ad2384e9cd0efd30b43f5302db24b6}] and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} from resource manager. > 2020-04-10 19:12:02,920 INFO&nbsp; org.apache.flink.yarn.YarnResourceManager&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;- Request slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} for job 24691b33c18d7ad73b1f52edb3d68ae4 with allocation id AllocationID{5a12237c7f2bd8b1cc760ddcbab5a1c0}. > 2020-04-10 19:12:02,921 INFO&nbsp; org.apache.flink.runtime.jobmaster.slotpool.SlotPool&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; - Requesting new slot [SlotRequestId{0feacbb4fe16c8c7a70249f1396565d0}] and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} from resource manager. > 2020-04-10 19:12:02,924 INFO&nbsp; org.apache.flink.yarn.YarnResourceManager&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;- Requesting new TaskExecutor container with resources <memory:4096, vCores:6&gt;. Number pending requests 1. > 2020-04-10 19:12:02,926 INFO&nbsp; org.apache.flink.yarn.YarnResourceManager&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;- Request slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} for job 24691b33c18d7ad73b1f52edb3d68ae4 with allocation id AllocationID{37dd666a18040bf63ffbf2e022b2ea9b}. > 2020-04-10 19:12:06,531 INFO&nbsp; org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;- Received new token for : trusfortpoc3:35206 > 2020-04-10 19:12:06,543 INFO&nbsp; org.apache.flink.yarn.YarnResourceManager&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;- Received new container: container_1586426824930_0006_01_000002 - Remaining pending container requests: 1 > 2020-04-10 19:12:06,543 INFO&nbsp; org.apache.flink.yarn.YarnResourceManager&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;- Removing container request Capability[<memory:4096, vCores:6&gt;]Priority[1]. Pending container requests 0. > 2020-04-10 19:12:06,568 ERROR org.apache.flink.yarn.YarnResourceManager&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;- Could not start TaskManager in container container_1586426824930_0006_01_000002. > java.lang.IllegalArgumentException: java.net.UnknownHostException: hdfsClusterForML > at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:378) > at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:320) > at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176) > at org.apache.hadoop.hdfs.DFSClient.<init&gt;(DFSClient.java:687) > at org.apache.hadoop.hdfs.DFSClient.<init&gt;(DFSClient.java:628) > at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149) > at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93) > at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2701) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2683) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:372) > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) > at org.apache.flink.yarn.Utils.createTaskExecutorContext(Utils.java:453) > at org.apache.flink.yarn.YarnResourceManager.createTaskExecutorLaunchContext(YarnResourceManager.java:555) > at org.apache.flink.yarn.YarnResourceManager.lambda$onContainersAllocated$1(YarnResourceManager.java:390) > at org.apache.flink.yarn.YarnResourceManager$$Lambda$183/1182651376.run(Unknown Source) > at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332) > at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158) > at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70) > at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142) > at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40) > at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165) > at akka.actor.Actor$class.aroundReceive(Actor.scala:502) > at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) > at akka.actor.ActorCell.invoke(ActorCell.scala:495) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) > at akka.dispatch.Mailbox.run(Mailbox.scala:224) > at akka.dispatch.Mailbox.exec(Mailbox.scala:234) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Caused by: java.net.UnknownHostException: hdfsClusterForML > ... 33 more > > > 这个hdfsClusterForML是namenode ha 的nameservice,经过分析是没加载hdfs-site.xml配置导致的, > 也尝试过把Hadoop的几个配置文件放到flink 的conf目录下但都无效,最终通过改YarnResourceManager源码后能够正常提交任务。 > public YarnResourceManager( > &nbsp; &nbsp; &nbsp; RpcService rpcService, > &nbsp; &nbsp; &nbsp; String resourceManagerEndpointId, > &nbsp; &nbsp; &nbsp; ResourceID resourceId, > &nbsp; &nbsp; &nbsp; Configuration flinkConfig, > &nbsp; &nbsp; &nbsp; Map<String, String&gt; env, > &nbsp; &nbsp; &nbsp; HighAvailabilityServices highAvailabilityServices, > &nbsp; &nbsp; &nbsp; HeartbeatServices heartbeatServices, > &nbsp; &nbsp; &nbsp; SlotManager slotManager, > &nbsp; &nbsp; &nbsp; MetricRegistry metricRegistry, > &nbsp; &nbsp; &nbsp; JobLeaderIdService jobLeaderIdService, > &nbsp; &nbsp; &nbsp; ClusterInformation clusterInformation, > &nbsp; &nbsp; &nbsp; FatalErrorHandler fatalErrorHandler, > &nbsp; &nbsp; &nbsp; @Nullable String webInterfaceUrl, > &nbsp; &nbsp; &nbsp; JobManagerMetricGroup jobManagerMetricGroup) { > &nbsp; &nbsp;super( > &nbsp; &nbsp; &nbsp; rpcService, > &nbsp; &nbsp; &nbsp; resourceManagerEndpointId, > &nbsp; &nbsp; &nbsp; resourceId, > &nbsp; &nbsp; &nbsp; highAvailabilityServices, > &nbsp; &nbsp; &nbsp; heartbeatServices, > &nbsp; &nbsp; &nbsp; slotManager, > &nbsp; &nbsp; &nbsp; metricRegistry, > &nbsp; &nbsp; &nbsp; jobLeaderIdService, > &nbsp; &nbsp; &nbsp; clusterInformation, > &nbsp; &nbsp; &nbsp; fatalErrorHandler, > &nbsp; &nbsp; &nbsp; jobManagerMetricGroup); > &nbsp; &nbsp;this.flinkConfig&nbsp; = flinkConfig; > &nbsp; &nbsp;this.yarnConfig = new YarnConfiguration(HadoopUtils.getHadoopConfiguration(flinkConfig)); > 但我认为这肯定不是解决问题的方法,所以向大家求助,是不是我忽略什么。 |
注意环境变量和 fs.hdfs.hdfsdefault 要配置成 HDFS 路径或 YARN
集群已知的本地路径,不要配置成客户端的路径。因为实际起作用是在拉起 TM 的那台机器上解析拉取的。 Best, tison. Chief <[hidden email]> 于2020年4月15日周三 下午7:40写道: > hi Yangze Guo > 您说的环境变量已经在当前用户的环境变量文件里面设置了,您可以看看我的问题描述,现在如果checkpoint的路径设置不是namenode > ha的nameservice就不会报错,checkpoint都正常。 > > > > > ------------------ 原始邮件 ------------------ > 发件人: "Yangze Guo"<[hidden email]>; > 发送时间: 2020年4月15日(星期三) 下午3:00 > 收件人: "user-zh"<[hidden email]>; > > 主题: Re: flink 1.7.2 YARN Session模式提交任务问题求助 > > > > Flink需要设置hadoop相关conf位置的环境变量 YARN_CONF_DIR or HADOOP_CONF_DIR [1] > > [1] > https://ci.apache.org/projects/flink/flink-docs-master/ops/deployment/yarn_setup.html > > Best, > Yangze Guo > > On Mon, Apr 13, 2020 at 10:52 PM Chief <[hidden email]> wrote: > > > > 大家好 > > 目前环境是flink 1.7.2,使用YARN Session模式提交任务,Hadoop 版本2.7.3,hdfs > namenode配置了ha模式,提交任务的时候报以下错误,系统环境变量中已经设置了HADOOP_HOME,YARN_CONF_DIR,HADOOP_CONF_DIR,HADOOP_CLASSPATH,在flink_conf.yaml中配置了fs.hdfs.hadoopconf > > > > > > 2020-04-10 19:12:02,908 INFO&nbsp; > org.apache.flink.runtime.jobmaster.JobMaster&nbsp; &nbsp; > &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; > &nbsp; - Connecting to ResourceManager akka.tcp://flink@trusfortpoc1 > :23584/user/resourcemanager(00000000000000000000000000000000) > > 2020-04-10 19:12:02,909 INFO&nbsp; > org.apache.flink.runtime.jobmaster.slotpool.SlotPool&nbsp; &nbsp; > &nbsp; &nbsp; &nbsp; - Cannot serve slot request, no > ResourceManager connected. Adding as pending request > [SlotRequestId{0feacbb4fe16c8c7a70249f1396565d0}] > > 2020-04-10 19:12:02,911 INFO&nbsp; > org.apache.flink.runtime.jobmaster.JobMaster&nbsp; &nbsp; > &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; > &nbsp; - Resolved ResourceManager address, beginning registration > > 2020-04-10 19:12:02,911 INFO&nbsp; > org.apache.flink.runtime.jobmaster.JobMaster&nbsp; &nbsp; > &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; > &nbsp; - Registration at ResourceManager attempt 1 (timeout=100ms) > > 2020-04-10 19:12:02,912 INFO&nbsp; > org.apache.flink.runtime.jobmaster.slotpool.SlotPool&nbsp; &nbsp; > &nbsp; &nbsp; &nbsp; - Cannot serve slot request, no > ResourceManager connected. Adding as pending request > [SlotRequestId{35ad2384e9cd0efd30b43f5302db24b6}] > > 2020-04-10 19:12:02,913 INFO&nbsp; > org.apache.flink.yarn.YarnResourceManager&nbsp; &nbsp; &nbsp; > &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; > &nbsp; &nbsp;- Registering job manager > [hidden email]://flink@trusfortpoc1:23584/user/jobmanager_0 > for job 24691b33c18d7ad73b1f52edb3d68ae4. > > 2020-04-10 19:12:02,917 INFO&nbsp; > org.apache.flink.yarn.YarnResourceManager&nbsp; &nbsp; &nbsp; > &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; > &nbsp; &nbsp;- Registered job manager > [hidden email]://flink@trusfortpoc1:23584/user/jobmanager_0 > for job 24691b33c18d7ad73b1f52edb3d68ae4. > > 2020-04-10 19:12:02,919 INFO&nbsp; > org.apache.flink.runtime.jobmaster.JobMaster&nbsp; &nbsp; > &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; > &nbsp; - JobManager successfully registered at ResourceManager, leader > id: 00000000000000000000000000000000. > > 2020-04-10 19:12:02,919 INFO&nbsp; > org.apache.flink.runtime.jobmaster.slotpool.SlotPool&nbsp; &nbsp; > &nbsp; &nbsp; &nbsp; - Requesting new slot > [SlotRequestId{35ad2384e9cd0efd30b43f5302db24b6}] and profile > ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, > nativeMemoryInMB=0, networkMemoryInMB=0} from resource manager. > > 2020-04-10 19:12:02,920 INFO&nbsp; > org.apache.flink.yarn.YarnResourceManager&nbsp; &nbsp; &nbsp; > &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; > &nbsp; &nbsp;- Request slot with profile > ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, > nativeMemoryInMB=0, networkMemoryInMB=0} for job > 24691b33c18d7ad73b1f52edb3d68ae4 with allocation id > AllocationID{5a12237c7f2bd8b1cc760ddcbab5a1c0}. > > 2020-04-10 19:12:02,921 INFO&nbsp; > org.apache.flink.runtime.jobmaster.slotpool.SlotPool&nbsp; &nbsp; > &nbsp; &nbsp; &nbsp; - Requesting new slot > [SlotRequestId{0feacbb4fe16c8c7a70249f1396565d0}] and profile > ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, > nativeMemoryInMB=0, networkMemoryInMB=0} from resource manager. > > 2020-04-10 19:12:02,924 INFO&nbsp; > org.apache.flink.yarn.YarnResourceManager&nbsp; &nbsp; &nbsp; > &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; > &nbsp; &nbsp;- Requesting new TaskExecutor container with resources > <memory:4096, vCores:6&gt;. Number pending requests 1. > > 2020-04-10 19:12:02,926 INFO&nbsp; > org.apache.flink.yarn.YarnResourceManager&nbsp; &nbsp; &nbsp; > &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; > &nbsp; &nbsp;- Request slot with profile > ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, > nativeMemoryInMB=0, networkMemoryInMB=0} for job > 24691b33c18d7ad73b1f52edb3d68ae4 with allocation id > AllocationID{37dd666a18040bf63ffbf2e022b2ea9b}. > > 2020-04-10 19:12:06,531 INFO&nbsp; > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl&nbsp; &nbsp; > &nbsp; &nbsp; &nbsp;- Received new token for : > trusfortpoc3:35206 > > 2020-04-10 19:12:06,543 INFO&nbsp; > org.apache.flink.yarn.YarnResourceManager&nbsp; &nbsp; &nbsp; > &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; > &nbsp; &nbsp;- Received new container: > container_1586426824930_0006_01_000002 - Remaining pending container > requests: 1 > > 2020-04-10 19:12:06,543 INFO&nbsp; > org.apache.flink.yarn.YarnResourceManager&nbsp; &nbsp; &nbsp; > &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; > &nbsp; &nbsp;- Removing container request Capability[<memory:4096, > vCores:6&gt;]Priority[1]. Pending container requests 0. > > 2020-04-10 19:12:06,568 ERROR > org.apache.flink.yarn.YarnResourceManager&nbsp; &nbsp; &nbsp; > &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; > &nbsp; &nbsp;- Could not start TaskManager in container > container_1586426824930_0006_01_000002. > > java.lang.IllegalArgumentException: java.net.UnknownHostException: > hdfsClusterForML > > at > org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:378) > > at > org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:320) > > at > org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176) > > at > org.apache.hadoop.hdfs.DFSClient.<init&gt;(DFSClient.java:687) > > at > org.apache.hadoop.hdfs.DFSClient.<init&gt;(DFSClient.java:628) > > at > org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149) > > at > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667) > > at > org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93) > > at > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2701) > > at > org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2683) > > at > org.apache.hadoop.fs.FileSystem.get(FileSystem.java:372) > > at > org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) > > at > org.apache.flink.yarn.Utils.createTaskExecutorContext(Utils.java:453) > > at > org.apache.flink.yarn.YarnResourceManager.createTaskExecutorLaunchContext(YarnResourceManager.java:555) > > at > org.apache.flink.yarn.YarnResourceManager.lambda$onContainersAllocated$1(YarnResourceManager.java:390) > > at > org.apache.flink.yarn.YarnResourceManager$$Lambda$183/1182651376.run(Unknown > Source) > > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332) > > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158) > > at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70) > > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142) > > at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40) > > at > akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165) > > at > akka.actor.Actor$class.aroundReceive(Actor.scala:502) > > at > akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) > > at > akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) > > at > akka.actor.ActorCell.invoke(ActorCell.scala:495) > > at > akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) > > at > akka.dispatch.Mailbox.run(Mailbox.scala:224) > > at > akka.dispatch.Mailbox.exec(Mailbox.scala:234) > > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > > Caused by: java.net.UnknownHostException: hdfsClusterForML > > ... 33 more > > > > > > 这个hdfsClusterForML是namenode ha > 的nameservice,经过分析是没加载hdfs-site.xml配置导致的, > > 也尝试过把Hadoop的几个配置文件放到flink > 的conf目录下但都无效,最终通过改YarnResourceManager源码后能够正常提交任务。 > > public YarnResourceManager( > > &nbsp; &nbsp; &nbsp; RpcService rpcService, > > &nbsp; &nbsp; &nbsp; String resourceManagerEndpointId, > > &nbsp; &nbsp; &nbsp; ResourceID resourceId, > > &nbsp; &nbsp; &nbsp; Configuration flinkConfig, > > &nbsp; &nbsp; &nbsp; Map<String, String&gt; env, > > &nbsp; &nbsp; &nbsp; HighAvailabilityServices > highAvailabilityServices, > > &nbsp; &nbsp; &nbsp; HeartbeatServices heartbeatServices, > > &nbsp; &nbsp; &nbsp; SlotManager slotManager, > > &nbsp; &nbsp; &nbsp; MetricRegistry metricRegistry, > > &nbsp; &nbsp; &nbsp; JobLeaderIdService > jobLeaderIdService, > > &nbsp; &nbsp; &nbsp; ClusterInformation > clusterInformation, > > &nbsp; &nbsp; &nbsp; FatalErrorHandler fatalErrorHandler, > > &nbsp; &nbsp; &nbsp; @Nullable String webInterfaceUrl, > > &nbsp; &nbsp; &nbsp; JobManagerMetricGroup > jobManagerMetricGroup) { > > &nbsp; &nbsp;super( > > &nbsp; &nbsp; &nbsp; rpcService, > > &nbsp; &nbsp; &nbsp; resourceManagerEndpointId, > > &nbsp; &nbsp; &nbsp; resourceId, > > &nbsp; &nbsp; &nbsp; highAvailabilityServices, > > &nbsp; &nbsp; &nbsp; heartbeatServices, > > &nbsp; &nbsp; &nbsp; slotManager, > > &nbsp; &nbsp; &nbsp; metricRegistry, > > &nbsp; &nbsp; &nbsp; jobLeaderIdService, > > &nbsp; &nbsp; &nbsp; clusterInformation, > > &nbsp; &nbsp; &nbsp; fatalErrorHandler, > > &nbsp; &nbsp; &nbsp; jobManagerMetricGroup); > > &nbsp; &nbsp;this.flinkConfig&nbsp; = flinkConfig; > > &nbsp; &nbsp;this.yarnConfig = new > YarnConfiguration(HadoopUtils.getHadoopConfiguration(flinkConfig)); > > 但我认为这肯定不是解决问题的方法,所以向大家求助,是不是我忽略什么。 |
Free forum by Nabble | Edit this page |