请问下有谁遇到过在CLI手动触发Flink的Savepoint的时候遇到超时的异常吗?
或者尝试把Job Cancel With Savepoint也是一样的超时错误. Savepoint是已经配置了存到HDFS上的, Flink本身Run在Yarn上. 在官网看到一个参数“akka.client.timeout”不知道是不是针对这个的, 但是这个参数生效是要配置在flink-conf.yml里的, 也没办法CLI传递进去. 这样Job没法Cancel, Flink Cluster也就没法重启,死循环了. 感谢! Setting HADOOP_CONF_DIR=/etc/hadoop/conf because no HADOOP_CONF_DIR was set. > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/opt/flink-1.6.0-hdp/lib/phoenix-4.7.0.2.6.3.0-235-client.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/opt/flink-1.6.0-hdp/lib/slf4j-log4j12-1.7.7.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > 2019-09-05 10:45:41,807 INFO > org.apache.flink.yarn.cli.FlinkYarnSessionCli - Found Yarn > properties file under /tmp/.yarn-properties-hive. > 2019-09-05 10:45:41,807 INFO > org.apache.flink.yarn.cli.FlinkYarnSessionCli - Found Yarn > properties file under /tmp/.yarn-properties-hive. > 2019-09-05 10:45:42,056 INFO > org.apache.flink.yarn.cli.FlinkYarnSessionCli - YARN > properties set default parallelism to 1 > 2019-09-05 10:45:42,056 INFO > org.apache.flink.yarn.cli.FlinkYarnSessionCli - YARN > properties set default parallelism to 1 > YARN properties set default parallelism to 1 > 2019-09-05 10:45:42,269 INFO org.apache.hadoop.yarn.client.AHSProxy > - Connecting to Application History server at > ac13ghdpt2m01.lab-rot.saas.sap.corp/10.116.201.103:10200 > 2019-09-05 10:45:42,276 INFO > org.apache.flink.yarn.cli.FlinkYarnSessionCli - No path > for the flink jar passed. Using the location of class > org.apache.flink.yarn.YarnClusterDescriptor to locate the jar > 2019-09-05 10:45:42,276 INFO > org.apache.flink.yarn.cli.FlinkYarnSessionCli - No path > for the flink jar passed. Using the location of class > org.apache.flink.yarn.YarnClusterDescriptor to locate the jar > 2019-09-05 10:45:42,282 WARN > org.apache.flink.yarn.AbstractYarnClusterDescriptor - Neither > the HADOOP_CONF_DIR nor the YARN_CONF_DIR environment variable is set.The > Flink YARN Client needs one of these to be set to properly load the Hadoop > configuration for accessing YARN. > 2019-09-05 10:45:42,284 INFO > org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider - > Looking for the active RM in [rm1, rm2]... > 2019-09-05 10:45:42,341 INFO > org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider - > Found active RM [rm1] > 2019-09-05 10:45:42,345 INFO > org.apache.flink.yarn.AbstractYarnClusterDescriptor - Found > application JobManager host name 'ac13ghdpt2dn01.lab-rot.saas.sap.corp' and > port '40192' from supplied application id 'application_1559153472177_52202' > 2019-09-05 10:45:42,689 WARN > org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory - The > short-circuit local reads feature cannot be used because libhadoop cannot > be loaded. > Triggering savepoint for job 6399ec2e8fdf4cb7d8481890019554f6. > Waiting for response... > ------------------------------------------------------------ > The program finished with the following exception: > org.apache.flink.util.FlinkException: Triggering a savepoint for the job > 6399ec2e8fdf4cb7d8481890019554f6 failed. > at > org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:714) > at > org.apache.flink.client.cli.CliFrontend.lambda$savepoint$9(CliFrontend.java:692) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:979) > at > org.apache.flink.client.cli.CliFrontend.savepoint(CliFrontend.java:689) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1059) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1120) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at > org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1120) > Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: > Could not complete the operation. Exception is not retryable. > at > org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) > at > org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:793) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.util.concurrent.CompletionException: > java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) > at > java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) > at > java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593) > at > java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) > ... 10 more > Caused by: java.util.concurrent.TimeoutException > ... 8 more > |
是不是所有的checkpoint都没有成功?如果没有有效的checkpoint,那么savepoint会失败
-- 高飞龙 手机 +86 18710107193 [hidden email] 发件人: Jimmy.Shao 发送时间: 2019-09-06 14:38 收件人: user-zh 主题: Flink Savepoint 超时 请问下有谁遇到过在CLI手动触发Flink的Savepoint的时候遇到超时的异常吗? 或者尝试把Job Cancel With Savepoint也是一样的超时错误. Savepoint是已经配置了存到HDFS上的, Flink本身Run在Yarn上. 在官网看到一个参数“akka.client.timeout”不知道是不是针对这个的, 但是这个参数生效是要配置在flink-conf.yml里的, 也没办法CLI传递进去. 这样Job没法Cancel, Flink Cluster也就没法重启,死循环了. 感谢! Setting HADOOP_CONF_DIR=/etc/hadoop/conf because no HADOOP_CONF_DIR was set. > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/opt/flink-1.6.0-hdp/lib/phoenix-4.7.0.2.6.3.0-235-client.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/opt/flink-1.6.0-hdp/lib/slf4j-log4j12-1.7.7.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > 2019-09-05 10:45:41,807 INFO > org.apache.flink.yarn.cli.FlinkYarnSessionCli - Found Yarn > properties file under /tmp/.yarn-properties-hive. > 2019-09-05 10:45:41,807 INFO > org.apache.flink.yarn.cli.FlinkYarnSessionCli - Found Yarn > properties file under /tmp/.yarn-properties-hive. > 2019-09-05 10:45:42,056 INFO > org.apache.flink.yarn.cli.FlinkYarnSessionCli - YARN > properties set default parallelism to 1 > 2019-09-05 10:45:42,056 INFO > org.apache.flink.yarn.cli.FlinkYarnSessionCli - YARN > properties set default parallelism to 1 > YARN properties set default parallelism to 1 > 2019-09-05 10:45:42,269 INFO org.apache.hadoop.yarn.client.AHSProxy > - Connecting to Application History server at > ac13ghdpt2m01.lab-rot.saas.sap.corp/10.116.201.103:10200 > 2019-09-05 10:45:42,276 INFO > org.apache.flink.yarn.cli.FlinkYarnSessionCli - No path > for the flink jar passed. Using the location of class > org.apache.flink.yarn.YarnClusterDescriptor to locate the jar > 2019-09-05 10:45:42,276 INFO > org.apache.flink.yarn.cli.FlinkYarnSessionCli - No path > for the flink jar passed. Using the location of class > org.apache.flink.yarn.YarnClusterDescriptor to locate the jar > 2019-09-05 10:45:42,282 WARN > org.apache.flink.yarn.AbstractYarnClusterDescriptor - Neither > the HADOOP_CONF_DIR nor the YARN_CONF_DIR environment variable is set.The > Flink YARN Client needs one of these to be set to properly load the Hadoop > configuration for accessing YARN. > 2019-09-05 10:45:42,284 INFO > org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider - > Looking for the active RM in [rm1, rm2]... > 2019-09-05 10:45:42,341 INFO > org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider - > Found active RM [rm1] > 2019-09-05 10:45:42,345 INFO > org.apache.flink.yarn.AbstractYarnClusterDescriptor - Found > application JobManager host name 'ac13ghdpt2dn01.lab-rot.saas.sap.corp' and > port '40192' from supplied application id 'application_1559153472177_52202' > 2019-09-05 10:45:42,689 WARN > org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory - The > short-circuit local reads feature cannot be used because libhadoop cannot > be loaded. > Triggering savepoint for job 6399ec2e8fdf4cb7d8481890019554f6. > Waiting for response... > ------------------------------------------------------------ > The program finished with the following exception: > org.apache.flink.util.FlinkException: Triggering a savepoint for the job > 6399ec2e8fdf4cb7d8481890019554f6 failed. > at > org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:714) > at > org.apache.flink.client.cli.CliFrontend.lambda$savepoint$9(CliFrontend.java:692) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:979) > at > org.apache.flink.client.cli.CliFrontend.savepoint(CliFrontend.java:689) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1059) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1120) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at > org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1120) > Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: > Could not complete the operation. Exception is not retryable. > at > org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) > at > org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:793) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.util.concurrent.CompletionException: > java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) > at > java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) > at > java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593) > at > java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) > ... 10 more > Caused by: java.util.concurrent.TimeoutException > ... 8 more > |
Checkpoints一直都是成功的。
今天重新尝试了一下cancle job with savepoint又成功了.. 不知道之前为什么试了几次都是超时的.. On Fri, Sep 6, 2019 at 4:33 PM [hidden email] < [hidden email]> wrote: > 是不是所有的checkpoint都没有成功?如果没有有效的checkpoint,那么savepoint会失败 > > > > -- > 高飞龙 > 手机 +86 18710107193 > [hidden email] > > 发件人: Jimmy.Shao > 发送时间: 2019-09-06 14:38 > 收件人: user-zh > 主题: Flink Savepoint 超时 > 请问下有谁遇到过在CLI手动触发Flink的Savepoint的时候遇到超时的异常吗? > 或者尝试把Job Cancel With Savepoint也是一样的超时错误. > Savepoint是已经配置了存到HDFS上的, > Flink本身Run在Yarn上. > 在官网看到一个参数“akka.client.timeout”不知道是不是针对这个的, > 但是这个参数生效是要配置在flink-conf.yml里的, > 也没办法CLI传递进去. > 这样Job没法Cancel, Flink Cluster也就没法重启,死循环了. > 感谢! > > Setting HADOOP_CONF_DIR=/etc/hadoop/conf because no HADOOP_CONF_DIR was > set. > > SLF4J: Class path contains multiple SLF4J bindings. > > SLF4J: Found binding in > > > [jar:file:/opt/flink-1.6.0-hdp/lib/phoenix-4.7.0.2.6.3.0-235-client.jar!/org/slf4j/impl/StaticLoggerBinder.class] > > SLF4J: Found binding in > > > [jar:file:/opt/flink-1.6.0-hdp/lib/slf4j-log4j12-1.7.7.jar!/org/slf4j/impl/StaticLoggerBinder.class] > > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > > explanation. > > 2019-09-05 10:45:41,807 INFO > > org.apache.flink.yarn.cli.FlinkYarnSessionCli - Found > Yarn > > properties file under /tmp/.yarn-properties-hive. > > 2019-09-05 10:45:41,807 INFO > > org.apache.flink.yarn.cli.FlinkYarnSessionCli - Found > Yarn > > properties file under /tmp/.yarn-properties-hive. > > 2019-09-05 10:45:42,056 INFO > > org.apache.flink.yarn.cli.FlinkYarnSessionCli - YARN > > properties set default parallelism to 1 > > 2019-09-05 10:45:42,056 INFO > > org.apache.flink.yarn.cli.FlinkYarnSessionCli - YARN > > properties set default parallelism to 1 > > YARN properties set default parallelism to 1 > > 2019-09-05 10:45:42,269 INFO org.apache.hadoop.yarn.client.AHSProxy > > - Connecting to Application History server at > > ac13ghdpt2m01.lab-rot.saas.sap.corp/10.116.201.103:10200 > > 2019-09-05 10:45:42,276 INFO > > org.apache.flink.yarn.cli.FlinkYarnSessionCli - No path > > for the flink jar passed. Using the location of class > > org.apache.flink.yarn.YarnClusterDescriptor to locate the jar > > 2019-09-05 10:45:42,276 INFO > > org.apache.flink.yarn.cli.FlinkYarnSessionCli - No path > > for the flink jar passed. Using the location of class > > org.apache.flink.yarn.YarnClusterDescriptor to locate the jar > > 2019-09-05 10:45:42,282 WARN > > org.apache.flink.yarn.AbstractYarnClusterDescriptor - Neither > > the HADOOP_CONF_DIR nor the YARN_CONF_DIR environment variable is set.The > > Flink YARN Client needs one of these to be set to properly load the > Hadoop > > configuration for accessing YARN. > > 2019-09-05 10:45:42,284 INFO > > org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider - > > Looking for the active RM in [rm1, rm2]... > > 2019-09-05 10:45:42,341 INFO > > org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider - > > Found active RM [rm1] > > 2019-09-05 10:45:42,345 INFO > > org.apache.flink.yarn.AbstractYarnClusterDescriptor - Found > > application JobManager host name 'ac13ghdpt2dn01.lab-rot.saas.sap.corp' > and > > port '40192' from supplied application id > 'application_1559153472177_52202' > > 2019-09-05 10:45:42,689 WARN > > org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory - The > > short-circuit local reads feature cannot be used because libhadoop cannot > > be loaded. > > Triggering savepoint for job 6399ec2e8fdf4cb7d8481890019554f6. > > Waiting for response... > > ------------------------------------------------------------ > > The program finished with the following exception: > > org.apache.flink.util.FlinkException: Triggering a savepoint for the job > > 6399ec2e8fdf4cb7d8481890019554f6 failed. > > at > > > org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:714) > > at > > > org.apache.flink.client.cli.CliFrontend.lambda$savepoint$9(CliFrontend.java:692) > > at > > > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:979) > > at > > org.apache.flink.client.cli.CliFrontend.savepoint(CliFrontend.java:689) > > at > > > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1059) > > at > > > org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1120) > > at java.security.AccessController.doPrivileged(Native Method) > > at javax.security.auth.Subject.doAs(Subject.java:422) > > at > > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866) > > at > > > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > > at > > org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1120) > > Caused by: > org.apache.flink.runtime.concurrent.FutureUtils$RetryException: > > Could not complete the operation. Exception is not retryable. > > at > > > org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213) > > at > > > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > > at > > > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > > at > > > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > > at > > > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) > > at > > > org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:793) > > at > > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > > at > > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > > at > > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > > at > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > > Caused by: java.util.concurrent.CompletionException: > > java.util.concurrent.TimeoutException > > at > > > java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) > > at > > > java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) > > at > > > java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593) > > at > > > java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) > > ... 10 more > > Caused by: java.util.concurrent.TimeoutException > > ... 8 more > > > |
SJMSTER wrote: > Checkpoints一直都是成功的。 > 今天重新尝试了一下cancle job with savepoint又成功了.. > 不知道之前为什么试了几次都是超时的.. are there any log items for diagnosis? regards. |
找了一圈 没有看到其他的错误.就只有上面我贴出来的异常了..
因为这个是CLI执行时报的错... On Fri, Sep 6, 2019 at 4:51 PM Wesley Peng <[hidden email]> wrote: > > > SJMSTER wrote: > > Checkpoints一直都是成功的。 > > 今天重新尝试了一下cancle job with savepoint又成功了.. > > 不知道之前为什么试了几次都是超时的.. > > are there any log items for diagnosis? > > regards. > |
Free forum by Nabble | Edit this page |