flink job自动checkpoint是成功,手动checkpoint失败

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

flink job自动checkpoint是成功,手动checkpoint失败

Zhou Zach




2020-06-19 15:11:18,361 INFO  org.apache.flink.client.cli.CliFrontend                       - Triggering savepoint for job e229c76e6a1b43142cb4272523102ed1.
2020-06-19 15:11:18,378 INFO  org.apache.flink.client.cli.CliFrontend                       - Waiting for response...
2020-06-19 15:11:48,381 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
2020-06-19 15:11:48,382 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
2020-06-19 15:11:48,385 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x172b776fac82479 closed
2020-06-19 15:11:48,385 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x172b776fac82479
2020-06-19 15:11:48,385 ERROR org.apache.flink.client.cli.CliFrontend                       - Error while running the command.
org.apache.flink.util.FlinkException: Triggering a savepoint for the job e229c76e6a1b43142cb4272523102ed1 failed.
        at org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:633)
        at org.apache.flink.client.cli.CliFrontend.lambda$savepoint$9(CliFrontend.java:611)
        at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843)
        at org.apache.flink.client.cli.CliFrontend.savepoint(CliFrontend.java:608)
        at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:910)
        at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
        at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
        at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968)
Caused by: java.util.concurrent.TimeoutException
        at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:999)
        at org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:211)
        at org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$14(FutureUtils.java:427)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Reply | Threaded
Open this post in threaded view
|

Re: flink job自动checkpoint是成功,手动checkpoint失败

Congxian Qiu
Hi

这里手动 Checkpoint 是指 Savepoint 吧。从栈看是因为超时了,有可能是 savepoint 比较慢导致的。
这个你可以看一下 JM log,看看是否 savepoint 很久才完成。

另外,可以描述下你们使用 savepoint 的主要场景吗?
1. 为什么要使用 savepoint
2. 在你们的场景中能否用 checkpoint 代替 savepoint 呢?

Best,
Congxian


Zhou Zach <[hidden email]> 于2020年6月19日周五 下午3:25写道:

>
>
>
>
> 2020-06-19 15:11:18,361 INFO  org.apache.flink.client.cli.CliFrontend
>                  - Triggering savepoint for job
> e229c76e6a1b43142cb4272523102ed1.
> 2020-06-19 15:11:18,378 INFO  org.apache.flink.client.cli.CliFrontend
>                  - Waiting for response...
> 2020-06-19 15:11:48,381 INFO
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
> Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
> 2020-06-19 15:11:48,382 INFO
> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl
> - backgroundOperationsLoop exiting
> 2020-06-19 15:11:48,385 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  -
> Session: 0x172b776fac82479 closed
> 2020-06-19 15:11:48,385 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
> EventThread shut down for session: 0x172b776fac82479
> 2020-06-19 15:11:48,385 ERROR org.apache.flink.client.cli.CliFrontend
>                  - Error while running the command.
> org.apache.flink.util.FlinkException: Triggering a savepoint for the job
> e229c76e6a1b43142cb4272523102ed1 failed.
>         at
> org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:633)
>         at
> org.apache.flink.client.cli.CliFrontend.lambda$savepoint$9(CliFrontend.java:611)
>         at
> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843)
>         at
> org.apache.flink.client.cli.CliFrontend.savepoint(CliFrontend.java:608)
>         at
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:910)
>         at
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
>         at
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>         at
> org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968)
> Caused by: java.util.concurrent.TimeoutException
>         at
> org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:999)
>         at
> org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:211)
>         at
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$14(FutureUtils.java:427)
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)