Apache Flink 中文用户邮件列表

Flink Savepoint 超时

Classic

List

Threaded

5 messages Options

Jimmy.Shao

Flink Savepoint 超时

请问下有谁遇到过在CLI手动触发Flink的Savepoint的时候遇到超时的异常吗？
或者尝试把Job Cancel With Savepoint也是一样的超时错误.
Savepoint是已经配置了存到HDFS上的,
Flink本身Run在Yarn上.
在官网看到一个参数“akka.client.timeout”不知道是不是针对这个的，
但是这个参数生效是要配置在flink-conf.yml里的,
也没办法CLI传递进去.
这样Job没法Cancel, Flink Cluster也就没法重启,死循环了.
感谢!

Setting HADOOP_CONF_DIR=/etc/hadoop/conf because no HADOOP_CONF_DIR was set.

> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in
> [jar:file:/opt/flink-1.6.0-hdp/lib/phoenix-4.7.0.2.6.3.0-235-client.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in
> [jar:file:/opt/flink-1.6.0-hdp/lib/slf4j-log4j12-1.7.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> explanation.
> 2019-09-05 10:45:41,807 INFO
> org.apache.flink.yarn.cli.FlinkYarnSessionCli - Found Yarn
> properties file under /tmp/.yarn-properties-hive.
> 2019-09-05 10:45:41,807 INFO
> org.apache.flink.yarn.cli.FlinkYarnSessionCli - Found Yarn
> properties file under /tmp/.yarn-properties-hive.
> 2019-09-05 10:45:42,056 INFO
> org.apache.flink.yarn.cli.FlinkYarnSessionCli - YARN
> properties set default parallelism to 1
> 2019-09-05 10:45:42,056 INFO
> org.apache.flink.yarn.cli.FlinkYarnSessionCli - YARN
> properties set default parallelism to 1
> YARN properties set default parallelism to 1
> 2019-09-05 10:45:42,269 INFO org.apache.hadoop.yarn.client.AHSProxy
> - Connecting to Application History server at
> ac13ghdpt2m01.lab-rot.saas.sap.corp/10.116.201.103:10200
> 2019-09-05 10:45:42,276 INFO
> org.apache.flink.yarn.cli.FlinkYarnSessionCli - No path
> for the flink jar passed. Using the location of class
> org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
> 2019-09-05 10:45:42,276 INFO
> org.apache.flink.yarn.cli.FlinkYarnSessionCli - No path
> for the flink jar passed. Using the location of class
> org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
> 2019-09-05 10:45:42,282 WARN
> org.apache.flink.yarn.AbstractYarnClusterDescriptor - Neither
> the HADOOP_CONF_DIR nor the YARN_CONF_DIR environment variable is set.The
> Flink YARN Client needs one of these to be set to properly load the Hadoop
> configuration for accessing YARN.
> 2019-09-05 10:45:42,284 INFO
> org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider -
> Looking for the active RM in [rm1, rm2]...
> 2019-09-05 10:45:42,341 INFO
> org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider -
> Found active RM [rm1]
> 2019-09-05 10:45:42,345 INFO
> org.apache.flink.yarn.AbstractYarnClusterDescriptor - Found
> application JobManager host name 'ac13ghdpt2dn01.lab-rot.saas.sap.corp' and
> port '40192' from supplied application id 'application_1559153472177_52202'
> 2019-09-05 10:45:42,689 WARN
> org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory - The
> short-circuit local reads feature cannot be used because libhadoop cannot
> be loaded.
> Triggering savepoint for job 6399ec2e8fdf4cb7d8481890019554f6.
> Waiting for response...
> ------------------------------------------------------------
> The program finished with the following exception:
> org.apache.flink.util.FlinkException: Triggering a savepoint for the job
> 6399ec2e8fdf4cb7d8481890019554f6 failed.
> at
> org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:714)
> at
> org.apache.flink.client.cli.CliFrontend.lambda$savepoint$9(CliFrontend.java:692)
> at
> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:979)
> at
> org.apache.flink.client.cli.CliFrontend.savepoint(CliFrontend.java:689)
> at
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1059)
> at
> org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1120)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
> at
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
> at
> org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1120)
> Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException:
> Could not complete the operation. Exception is not retryable.
> at
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213)
> at
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
> at
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
> at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
> at
> org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:793)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.concurrent.CompletionException:
> java.util.concurrent.TimeoutException
> at
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
> at
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
> at
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
> at
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
> ... 10 more
> Caused by: java.util.concurrent.TimeoutException
> ... 8 more
>

gaofeilong198810@163.com

Re: Flink Savepoint 超时

是不是所有的checkpoint都没有成功？如果没有有效的checkpoint，那么savepoint会失败

--
高飞龙
手机 +86 18710107193
[hidden email]

发件人： Jimmy.Shao
发送时间： 2019-09-06 14:38
收件人： user-zh
主题： Flink Savepoint 超时
请问下有谁遇到过在CLI手动触发Flink的Savepoint的时候遇到超时的异常吗？
或者尝试把Job Cancel With Savepoint也是一样的超时错误.
Savepoint是已经配置了存到HDFS上的,
Flink本身Run在Yarn上.
在官网看到一个参数“akka.client.timeout”不知道是不是针对这个的，
但是这个参数生效是要配置在flink-conf.yml里的,
也没办法CLI传递进去.
这样Job没法Cancel, Flink Cluster也就没法重启,死循环了.
感谢!

Setting HADOOP_CONF_DIR=/etc/hadoop/conf because no HADOOP_CONF_DIR was set.

Jimmy.Shao

Re: Flink Savepoint 超时

Checkpoints一直都是成功的。
今天重新尝试了一下cancle job with savepoint又成功了..
不知道之前为什么试了几次都是超时的..

On Fri, Sep 6, 2019 at 4:33 PM [hidden email] <
[hidden email]> wrote:

> 是不是所有的checkpoint都没有成功？如果没有有效的checkpoint，那么savepoint会失败
>
>
>
> --
> 高飞龙
> 手机 +86 18710107193
> [hidden email]
>
> 发件人： Jimmy.Shao
> 发送时间： 2019-09-06 14:38
> 收件人： user-zh
> 主题： Flink Savepoint 超时
> 请问下有谁遇到过在CLI手动触发Flink的Savepoint的时候遇到超时的异常吗？
> 或者尝试把Job Cancel With Savepoint也是一样的超时错误.
> Savepoint是已经配置了存到HDFS上的,
> Flink本身Run在Yarn上.
> 在官网看到一个参数“akka.client.timeout”不知道是不是针对这个的，
> 但是这个参数生效是要配置在flink-conf.yml里的,
> 也没办法CLI传递进去.
> 这样Job没法Cancel, Flink Cluster也就没法重启,死循环了.
> 感谢!
>
> Setting HADOOP_CONF_DIR=/etc/hadoop/conf because no HADOOP_CONF_DIR was
> set.
> > SLF4J: Class path contains multiple SLF4J bindings.
> > SLF4J: Found binding in
> >
> [jar:file:/opt/flink-1.6.0-hdp/lib/phoenix-4.7.0.2.6.3.0-235-client.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> > SLF4J: Found binding in
> >
> [jar:file:/opt/flink-1.6.0-hdp/lib/slf4j-log4j12-1.7.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> > explanation.
> > 2019-09-05 10:45:41,807 INFO
> > org.apache.flink.yarn.cli.FlinkYarnSessionCli - Found
> Yarn
> > properties file under /tmp/.yarn-properties-hive.
> > 2019-09-05 10:45:41,807 INFO
> > org.apache.flink.yarn.cli.FlinkYarnSessionCli - Found
> Yarn
> > properties file under /tmp/.yarn-properties-hive.
> > 2019-09-05 10:45:42,056 INFO
> > org.apache.flink.yarn.cli.FlinkYarnSessionCli - YARN
> > properties set default parallelism to 1
> > 2019-09-05 10:45:42,056 INFO
> > org.apache.flink.yarn.cli.FlinkYarnSessionCli - YARN
> > properties set default parallelism to 1
> > YARN properties set default parallelism to 1
> > 2019-09-05 10:45:42,269 INFO org.apache.hadoop.yarn.client.AHSProxy
> > - Connecting to Application History server at
> > ac13ghdpt2m01.lab-rot.saas.sap.corp/10.116.201.103:10200
> > 2019-09-05 10:45:42,276 INFO
> > org.apache.flink.yarn.cli.FlinkYarnSessionCli - No path
> > for the flink jar passed. Using the location of class
> > org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
> > 2019-09-05 10:45:42,276 INFO
> > org.apache.flink.yarn.cli.FlinkYarnSessionCli - No path
> > for the flink jar passed. Using the location of class
> > org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
> > 2019-09-05 10:45:42,282 WARN
> > org.apache.flink.yarn.AbstractYarnClusterDescriptor - Neither
> > the HADOOP_CONF_DIR nor the YARN_CONF_DIR environment variable is set.The
> > Flink YARN Client needs one of these to be set to properly load the
> Hadoop
> > configuration for accessing YARN.
> > 2019-09-05 10:45:42,284 INFO
> > org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider -
> > Looking for the active RM in [rm1, rm2]...
> > 2019-09-05 10:45:42,341 INFO
> > org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider -
> > Found active RM [rm1]
> > 2019-09-05 10:45:42,345 INFO
> > org.apache.flink.yarn.AbstractYarnClusterDescriptor - Found
> > application JobManager host name 'ac13ghdpt2dn01.lab-rot.saas.sap.corp'
> and
> > port '40192' from supplied application id
> 'application_1559153472177_52202'
> > 2019-09-05 10:45:42,689 WARN
> > org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory - The
> > short-circuit local reads feature cannot be used because libhadoop cannot
> > be loaded.
> > Triggering savepoint for job 6399ec2e8fdf4cb7d8481890019554f6.
> > Waiting for response...
> > ------------------------------------------------------------
> > The program finished with the following exception:
> > org.apache.flink.util.FlinkException: Triggering a savepoint for the job
> > 6399ec2e8fdf4cb7d8481890019554f6 failed.
> > at
> >
> org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:714)
> > at
> >
> org.apache.flink.client.cli.CliFrontend.lambda$savepoint$9(CliFrontend.java:692)
> > at
> >
> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:979)
> > at
> > org.apache.flink.client.cli.CliFrontend.savepoint(CliFrontend.java:689)
> > at
> >
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1059)
> > at
> >
> org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1120)
> > at java.security.AccessController.doPrivileged(Native Method)
> > at javax.security.auth.Subject.doAs(Subject.java:422)
> > at
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
> > at
> >
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
> > at
> > org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1120)
> > Caused by:
> org.apache.flink.runtime.concurrent.FutureUtils$RetryException:
> > Could not complete the operation. Exception is not retryable.
> > at
> >
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213)
> > at
> >
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
> > at
> >
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
> > at
> >
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> > at
> >
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
> > at
> >
> org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:793)
> > at
> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> > at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> > at
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> > at
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> > at java.lang.Thread.run(Thread.java:745)
> > Caused by: java.util.concurrent.CompletionException:
> > java.util.concurrent.TimeoutException
> > at
> >
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
> > at
> >
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
> > at
> >
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
> > at
> >
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
> > ... 10 more
> > Caused by: java.util.concurrent.TimeoutException
> > ... 8 more
> >
>

Wesley Peng-3

Re: Flink Savepoint 超时

SJMSTER wrote:
> Checkpoints一直都是成功的。
> 今天重新尝试了一下cancle job with savepoint又成功了..
> 不知道之前为什么试了几次都是超时的..

are there any log items for diagnosis?

regards.

Jimmy.Shao

Re: Flink Savepoint 超时

找了一圈没有看到其他的错误.就只有上面我贴出来的异常了..
因为这个是CLI执行时报的错...

On Fri, Sep 6, 2019 at 4:51 PM Wesley Peng <[hidden email]> wrote:

>
>
> SJMSTER wrote:
> > Checkpoints一直都是成功的。
> > 今天重新尝试了一下cancle job with savepoint又成功了..
> > 不知道之前为什么试了几次都是超时的..
>
> are there any log items for diagnosis?
>
> regards.
>