flink savepoint

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

flink savepoint

张锴
本人在使用flink savepoint 保存快照的时候,遇到错误,目前不清楚是因为什么导致这个问题,路过的大佬帮忙看下。

flink 版本1.10.1


执行   flink savepoint a3a2e6c3a5a00bbe4c0c9e351dc58c47
hdfs://hadoopnamenodeHA/flink/flink-savepoints


出现错误信息


org.apache.flink.util.FlinkException: Triggering a savepoint for the job
a3a2e6c3a5a00bbe4c0c9e351dc58c47 failed.

 at
org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:631)

 at
org.apache.flink.client.cli.CliFrontend.lambda$savepoint$9(CliFrontend.java:609)

 at
org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:841)

 at org.apache.flink.client.cli.CliFrontend.savepoint(CliFrontend.java:606)

 at
org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:908)

 at
org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:966)

 at java.security.AccessController.doPrivileged(Native Method)

 at javax.security.auth.Subject.doAs(Subject.java:422)

 at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)

 at
org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)

 at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:966)

Caused by: java.util.concurrent.TimeoutException

 at
java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771)

 at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)

 at
org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:625)
Reply | Threaded
Open this post in threaded view
|

Re:flink savepoint

hailongwang
Hi,


这个报错只是在规定的时间内没有完成 Savepoint,导致客户端连接 Master 超时,
具体的原因需要看下 Jobmaster 的日志。
PS:在任务一直重启、反压的情况下,一般 Savepoint 都会失败。


Best,
Hailong Wang




在 2020-11-06 09:33:48,"张锴" <[hidden email]> 写道:

>本人在使用flink savepoint 保存快照的时候,遇到错误,目前不清楚是因为什么导致这个问题,路过的大佬帮忙看下。
>
>flink 版本1.10.1
>
>
>执行   flink savepoint a3a2e6c3a5a00bbe4c0c9e351dc58c47
>hdfs://hadoopnamenodeHA/flink/flink-savepoints
>
>
>出现错误信息
>
>
>org.apache.flink.util.FlinkException: Triggering a savepoint for the job
>a3a2e6c3a5a00bbe4c0c9e351dc58c47 failed.
>
> at
>org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:631)
>
> at
>org.apache.flink.client.cli.CliFrontend.lambda$savepoint$9(CliFrontend.java:609)
>
> at
>org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:841)
>
> at org.apache.flink.client.cli.CliFrontend.savepoint(CliFrontend.java:606)
>
> at
>org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:908)
>
> at
>org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:966)
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at javax.security.auth.Subject.doAs(Subject.java:422)
>
> at
>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
>
> at
>org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>
> at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:966)
>
>Caused by: java.util.concurrent.TimeoutException
>
> at
>java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771)
>
> at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
>
> at
>org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:625)
Reply | Threaded
Open this post in threaded view
|

Re: flink savepoint

张锴
重启和反压都正常
另外增加了从客户端到master的时间,还是有这个问题

hailongwang <[hidden email]> 于 2020年11月6日周五 10:54写道:

> Hi,
>
>
> 这个报错只是在规定的时间内没有完成 Savepoint,导致客户端连接 Master 超时,
> 具体的原因需要看下 Jobmaster 的日志。
> PS:在任务一直重启、反压的情况下,一般 Savepoint 都会失败。
>
>
> Best,
> Hailong Wang
>
>
>
>
> 在 2020-11-06 09:33:48,"张锴" <[hidden email]> 写道:
> >本人在使用flink savepoint 保存快照的时候,遇到错误,目前不清楚是因为什么导致这个问题,路过的大佬帮忙看下。
> >
> >flink 版本1.10.1
> >
> >
> >执行   flink savepoint a3a2e6c3a5a00bbe4c0c9e351dc58c47
> >hdfs://hadoopnamenodeHA/flink/flink-savepoints
> >
> >
> >出现错误信息
> >
> >
> >org.apache.flink.util.FlinkException: Triggering a savepoint for the job
> >a3a2e6c3a5a00bbe4c0c9e351dc58c47 failed.
> >
> > at
>
> >org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:631)
> >
> > at
>
> >org.apache.flink.client.cli.CliFrontend.lambda$savepoint$9(CliFrontend.java:609)
> >
> > at
>
> >org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:841)
> >
> > at
> org.apache.flink.client.cli.CliFrontend.savepoint(CliFrontend.java:606)
> >
> > at
>
> >org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:908)
> >
> > at
>
> >org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:966)
> >
> > at java.security.AccessController.doPrivileged(Native Method)
> >
> > at javax.security.auth.Subject.doAs(Subject.java:422)
> >
> > at
>
> >org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
> >
> > at
>
> >org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
> >
> > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:966)
> >
> >Caused by: java.util.concurrent.TimeoutException
> >
> > at
>
> >java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771)
> >
> > at
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
> >
> > at
>
> >org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:625)
>
Reply | Threaded
Open this post in threaded view
|

Re: flink savepoint

Congxian Qiu
Hi
     从 client 端日志,或者 JM 日志还能看到其他的异常么?
Best,
Congxian


张锴 <[hidden email]> 于2020年11月6日周五 上午11:42写道:

> 重启和反压都正常
> 另外增加了从客户端到master的时间,还是有这个问题
>
> hailongwang <[hidden email]> 于 2020年11月6日周五 10:54写道:
>
> > Hi,
> >
> >
> > 这个报错只是在规定的时间内没有完成 Savepoint,导致客户端连接 Master 超时,
> > 具体的原因需要看下 Jobmaster 的日志。
> > PS:在任务一直重启、反压的情况下,一般 Savepoint 都会失败。
> >
> >
> > Best,
> > Hailong Wang
> >
> >
> >
> >
> > 在 2020-11-06 09:33:48,"张锴" <[hidden email]> 写道:
> > >本人在使用flink savepoint 保存快照的时候,遇到错误,目前不清楚是因为什么导致这个问题,路过的大佬帮忙看下。
> > >
> > >flink 版本1.10.1
> > >
> > >
> > >执行   flink savepoint a3a2e6c3a5a00bbe4c0c9e351dc58c47
> > >hdfs://hadoopnamenodeHA/flink/flink-savepoints
> > >
> > >
> > >出现错误信息
> > >
> > >
> > >org.apache.flink.util.FlinkException: Triggering a savepoint for the job
> > >a3a2e6c3a5a00bbe4c0c9e351dc58c47 failed.
> > >
> > > at
> >
> >
> >org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:631)
> > >
> > > at
> >
> >
> >org.apache.flink.client.cli.CliFrontend.lambda$savepoint$9(CliFrontend.java:609)
> > >
> > > at
> >
> >
> >org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:841)
> > >
> > > at
> > org.apache.flink.client.cli.CliFrontend.savepoint(CliFrontend.java:606)
> > >
> > > at
> >
> >
> >org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:908)
> > >
> > > at
> >
> >
> >org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:966)
> > >
> > > at java.security.AccessController.doPrivileged(Native Method)
> > >
> > > at javax.security.auth.Subject.doAs(Subject.java:422)
> > >
> > > at
> >
> >
> >org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
> > >
> > > at
> >
> >
> >org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
> > >
> > > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:966)
> > >
> > >Caused by: java.util.concurrent.TimeoutException
> > >
> > > at
> >
> >
> >java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771)
> > >
> > > at
> > java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
> > >
> > > at
> >
> >
> >org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:625)
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: flink savepoint

admin
Hi,
你的任务时跑在yarn上的吗?如果是 需要指定 -yid

> 2020年11月6日 下午1:31,Congxian Qiu <[hidden email]> 写道:
>
> Hi
>     从 client 端日志,或者 JM 日志还能看到其他的异常么?
> Best,
> Congxian
>
>
> 张锴 <[hidden email]> 于2020年11月6日周五 上午11:42写道:
>
>> 重启和反压都正常
>> 另外增加了从客户端到master的时间,还是有这个问题
>>
>> hailongwang <[hidden email]> 于 2020年11月6日周五 10:54写道:
>>
>>> Hi,
>>>
>>>
>>> 这个报错只是在规定的时间内没有完成 Savepoint,导致客户端连接 Master 超时,
>>> 具体的原因需要看下 Jobmaster 的日志。
>>> PS:在任务一直重启、反压的情况下,一般 Savepoint 都会失败。
>>>
>>>
>>> Best,
>>> Hailong Wang
>>>
>>>
>>>
>>>
>>> 在 2020-11-06 09:33:48,"张锴" <[hidden email]> 写道:
>>>> 本人在使用flink savepoint 保存快照的时候,遇到错误,目前不清楚是因为什么导致这个问题,路过的大佬帮忙看下。
>>>>
>>>> flink 版本1.10.1
>>>>
>>>>
>>>> 执行   flink savepoint a3a2e6c3a5a00bbe4c0c9e351dc58c47
>>>> hdfs://hadoopnamenodeHA/flink/flink-savepoints
>>>>
>>>>
>>>> 出现错误信息
>>>>
>>>>
>>>> org.apache.flink.util.FlinkException: Triggering a savepoint for the job
>>>> a3a2e6c3a5a00bbe4c0c9e351dc58c47 failed.
>>>>
>>>> at
>>>
>>>
>>> org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:631)
>>>>
>>>> at
>>>
>>>
>>> org.apache.flink.client.cli.CliFrontend.lambda$savepoint$9(CliFrontend.java:609)
>>>>
>>>> at
>>>
>>>
>>> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:841)
>>>>
>>>> at
>>> org.apache.flink.client.cli.CliFrontend.savepoint(CliFrontend.java:606)
>>>>
>>>> at
>>>
>>>
>>> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:908)
>>>>
>>>> at
>>>
>>>
>>> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:966)
>>>>
>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>>
>>>> at javax.security.auth.Subject.doAs(Subject.java:422)
>>>>
>>>> at
>>>
>>>
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
>>>>
>>>> at
>>>
>>>
>>> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>>>>
>>>> at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:966)
>>>>
>>>> Caused by: java.util.concurrent.TimeoutException
>>>>
>>>> at
>>>
>>>
>>> java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771)
>>>>
>>>> at
>>> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
>>>>
>>>> at
>>>
>>>
>>> org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:625)
>>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: flink savepoint

张锴
已经指定了

admin <[hidden email]> 于2020年11月6日周五 下午3:17写道:

> Hi,
> 你的任务时跑在yarn上的吗?如果是 需要指定 -yid
>
> > 2020年11月6日 下午1:31,Congxian Qiu <[hidden email]> 写道:
> >
> > Hi
> >     从 client 端日志,或者 JM 日志还能看到其他的异常么?
> > Best,
> > Congxian
> >
> >
> > 张锴 <[hidden email]> 于2020年11月6日周五 上午11:42写道:
> >
> >> 重启和反压都正常
> >> 另外增加了从客户端到master的时间,还是有这个问题
> >>
> >> hailongwang <[hidden email]> 于 2020年11月6日周五 10:54写道:
> >>
> >>> Hi,
> >>>
> >>>
> >>> 这个报错只是在规定的时间内没有完成 Savepoint,导致客户端连接 Master 超时,
> >>> 具体的原因需要看下 Jobmaster 的日志。
> >>> PS:在任务一直重启、反压的情况下,一般 Savepoint 都会失败。
> >>>
> >>>
> >>> Best,
> >>> Hailong Wang
> >>>
> >>>
> >>>
> >>>
> >>> 在 2020-11-06 09:33:48,"张锴" <[hidden email]> 写道:
> >>>> 本人在使用flink savepoint 保存快照的时候,遇到错误,目前不清楚是因为什么导致这个问题,路过的大佬帮忙看下。
> >>>>
> >>>> flink 版本1.10.1
> >>>>
> >>>>
> >>>> 执行   flink savepoint a3a2e6c3a5a00bbe4c0c9e351dc58c47
> >>>> hdfs://hadoopnamenodeHA/flink/flink-savepoints
> >>>>
> >>>>
> >>>> 出现错误信息
> >>>>
> >>>>
> >>>> org.apache.flink.util.FlinkException: Triggering a savepoint for the
> job
> >>>> a3a2e6c3a5a00bbe4c0c9e351dc58c47 failed.
> >>>>
> >>>> at
> >>>
> >>>
> >>>
> org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:631)
> >>>>
> >>>> at
> >>>
> >>>
> >>>
> org.apache.flink.client.cli.CliFrontend.lambda$savepoint$9(CliFrontend.java:609)
> >>>>
> >>>> at
> >>>
> >>>
> >>>
> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:841)
> >>>>
> >>>> at
> >>> org.apache.flink.client.cli.CliFrontend.savepoint(CliFrontend.java:606)
> >>>>
> >>>> at
> >>>
> >>>
> >>>
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:908)
> >>>>
> >>>> at
> >>>
> >>>
> >>>
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:966)
> >>>>
> >>>> at java.security.AccessController.doPrivileged(Native Method)
> >>>>
> >>>> at javax.security.auth.Subject.doAs(Subject.java:422)
> >>>>
> >>>> at
> >>>
> >>>
> >>>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
> >>>>
> >>>> at
> >>>
> >>>
> >>>
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
> >>>>
> >>>> at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:966)
> >>>>
> >>>> Caused by: java.util.concurrent.TimeoutException
> >>>>
> >>>> at
> >>>
> >>>
> >>>
> java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771)
> >>>>
> >>>> at
> >>> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
> >>>>
> >>>> at
> >>>
> >>>
> >>>
> org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:625)
> >>>
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: flink savepoint

张锴
In reply to this post by Congxian Qiu
看到了,通过JM看到是写的权限没有,改了之后就好了

Congxian Qiu <[hidden email]> 于2020年11月6日周五 下午1:31写道:

> Hi
>      从 client 端日志,或者 JM 日志还能看到其他的异常么?
> Best,
> Congxian
>
>
> 张锴 <[hidden email]> 于2020年11月6日周五 上午11:42写道:
>
> > 重启和反压都正常
> > 另外增加了从客户端到master的时间,还是有这个问题
> >
> > hailongwang <[hidden email]> 于 2020年11月6日周五 10:54写道:
> >
> > > Hi,
> > >
> > >
> > > 这个报错只是在规定的时间内没有完成 Savepoint,导致客户端连接 Master 超时,
> > > 具体的原因需要看下 Jobmaster 的日志。
> > > PS:在任务一直重启、反压的情况下,一般 Savepoint 都会失败。
> > >
> > >
> > > Best,
> > > Hailong Wang
> > >
> > >
> > >
> > >
> > > 在 2020-11-06 09:33:48,"张锴" <[hidden email]> 写道:
> > > >本人在使用flink savepoint 保存快照的时候,遇到错误,目前不清楚是因为什么导致这个问题,路过的大佬帮忙看下。
> > > >
> > > >flink 版本1.10.1
> > > >
> > > >
> > > >执行   flink savepoint a3a2e6c3a5a00bbe4c0c9e351dc58c47
> > > >hdfs://hadoopnamenodeHA/flink/flink-savepoints
> > > >
> > > >
> > > >出现错误信息
> > > >
> > > >
> > > >org.apache.flink.util.FlinkException: Triggering a savepoint for the
> job
> > > >a3a2e6c3a5a00bbe4c0c9e351dc58c47 failed.
> > > >
> > > > at
> > >
> > >
> >
> >org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:631)
> > > >
> > > > at
> > >
> > >
> >
> >org.apache.flink.client.cli.CliFrontend.lambda$savepoint$9(CliFrontend.java:609)
> > > >
> > > > at
> > >
> > >
> >
> >org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:841)
> > > >
> > > > at
> > > org.apache.flink.client.cli.CliFrontend.savepoint(CliFrontend.java:606)
> > > >
> > > > at
> > >
> > >
> >
> >org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:908)
> > > >
> > > > at
> > >
> > >
> >
> >org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:966)
> > > >
> > > > at java.security.AccessController.doPrivileged(Native Method)
> > > >
> > > > at javax.security.auth.Subject.doAs(Subject.java:422)
> > > >
> > > > at
> > >
> > >
> >
> >org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
> > > >
> > > > at
> > >
> > >
> >
> >org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
> > > >
> > > > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:966)
> > > >
> > > >Caused by: java.util.concurrent.TimeoutException
> > > >
> > > > at
> > >
> > >
> >
> >java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771)
> > > >
> > > > at
> > > java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
> > > >
> > > > at
> > >
> > >
> >
> >org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:625)
> > >
> >
>