flink1.10 stop with a savepoint失败

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

flink1.10 stop with a savepoint失败

LiangbinZhang
普通的source -> map -> filter-> sink 测试应用。

触发savepoint的脚本 :
    ${FLINK_HOME} stop -p ${TARGET_DIR} -d ${JOB_ID}
具体报错信息:

org.apache.flink.util.FlinkException: Could not stop with a savepoint job
"81990282a4686ebda3d04041e3620776".
        at
org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:462)
        at
org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843)
        at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:454)
        at
org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:907)
        at
org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
        at
org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
        at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968)
Caused by: java.util.concurrent.TimeoutException
        at
java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
        at
org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:460)
        ... 9 more


查看报错,怀疑是权限问题,我是root用户启动的应用,savepoint目录所在的hdfs路径权限所属也是root,如果不停止应用,直接触发savepoint没问题,继续定位到是root用户去停止hadoop
应用遇到权限问题,但是不知道怎么解决,目前卡在这里。



--
Sent from: http://apache-flink.147419.n8.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: flink1.10 stop with a savepoint失败

Congxian Qiu
Hi
    你可以看下 JM log 中这个 savepoint 失败是什么原因导致的,如果是 savepoint 超时了,就要看哪个 task
完成的慢,(savepoint 可能比 checkpoint 要慢)
Best,
Congxian


Robin Zhang <[hidden email]> 于2020年10月19日周一 下午3:42写道:

> 普通的source -> map -> filter-> sink 测试应用。
>
> 触发savepoint的脚本 :
>     ${FLINK_HOME} stop -p ${TARGET_DIR} -d ${JOB_ID}
> 具体报错信息:
>
> org.apache.flink.util.FlinkException: Could not stop with a savepoint job
> "81990282a4686ebda3d04041e3620776".
>         at
> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:462)
>         at
>
> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843)
>         at
> org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:454)
>         at
>
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:907)
>         at
>
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
>         at
>
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>         at
> org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968)
> Caused by: java.util.concurrent.TimeoutException
>         at
>
> java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771)
>         at
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
>         at
> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:460)
>         ... 9 more
>
>
>
> 查看报错,怀疑是权限问题,我是root用户启动的应用,savepoint目录所在的hdfs路径权限所属也是root,如果不停止应用,直接触发savepoint没问题,继续定位到是root用户去停止hadoop
> 应用遇到权限问题,但是不知道怎么解决,目前卡在这里。
>
>
>
> --
> Sent from: http://apache-flink.147419.n8.nabble.com/
>
Reply | Threaded
Open this post in threaded view
|

Re: flink1.10 stop with a savepoint失败

zilong xiao
In reply to this post by LiangbinZhang
Hi Robin Zhang
你应该是遇到了这个issue报告的问题:https://issues.apache.org/jira/browse/FLINK-16626
,可以看下这个issue描述,祝好~

Robin Zhang <[hidden email]> 于2020年10月19日周一 下午3:42写道:

> 普通的source -> map -> filter-> sink 测试应用。
>
> 触发savepoint的脚本 :
>     ${FLINK_HOME} stop -p ${TARGET_DIR} -d ${JOB_ID}
> 具体报错信息:
>
> org.apache.flink.util.FlinkException: Could not stop with a savepoint job
> "81990282a4686ebda3d04041e3620776".
>         at
> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:462)
>         at
>
> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843)
>         at
> org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:454)
>         at
>
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:907)
>         at
>
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
>         at
>
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>         at
> org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968)
> Caused by: java.util.concurrent.TimeoutException
>         at
>
> java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771)
>         at
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
>         at
> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:460)
>         ... 9 more
>
>
>
> 查看报错,怀疑是权限问题,我是root用户启动的应用,savepoint目录所在的hdfs路径权限所属也是root,如果不停止应用,直接触发savepoint没问题,继续定位到是root用户去停止hadoop
> 应用遇到权限问题,但是不知道怎么解决,目前卡在这里。
>
>
>
> --
> Sent from: http://apache-flink.147419.n8.nabble.com/
>
Reply | Threaded
Open this post in threaded view
|

Re: flink1.10 stop with a savepoint失败

LiangbinZhang
In reply to this post by Congxian Qiu
Hi,Congxian
    感谢提供思路,看了一下,JM端没有暴露日志,只能查看到ck正常的日志

Best,
Robin



Congxian Qiu wrote
> Hi
>     你可以看下 JM log 中这个 savepoint 失败是什么原因导致的,如果是 savepoint 超时了,就要看哪个 task
> 完成的慢,(savepoint 可能比 checkpoint 要慢)
> Best,
> Congxian
>
>
> Robin Zhang &lt;

> vincent2015qdlg@

> &gt; 于2020年10月19日周一 下午3:42写道:
>
>> 普通的source -> map -> filter-> sink 测试应用。
>>
>> 触发savepoint的脚本 :
>>     ${FLINK_HOME} stop -p ${TARGET_DIR} -d ${JOB_ID}
>> 具体报错信息:
>>
>> org.apache.flink.util.FlinkException: Could not stop with a savepoint job
>> "81990282a4686ebda3d04041e3620776".
>>         at
>> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:462)
>>         at
>>
>> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843)
>>         at
>> org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:454)
>>         at
>>
>> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:907)
>>         at
>>
>> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968)
>>         at java.security.AccessController.doPrivileged(Native Method)
>>         at javax.security.auth.Subject.doAs(Subject.java:422)
>>         at
>>
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
>>         at
>>
>> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>>         at
>> org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968)
>> Caused by: java.util.concurrent.TimeoutException
>>         at
>>
>> java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771)
>>         at
>> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
>>         at
>> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:460)
>>         ... 9 more
>>
>>
>>
>> 查看报错,怀疑是权限问题,我是root用户启动的应用,savepoint目录所在的hdfs路径权限所属也是root,如果不停止应用,直接触发savepoint没问题,继续定位到是root用户去停止hadoop
>> 应用遇到权限问题,但是不知道怎么解决,目前卡在这里。
>>
>>
>>
>> --
>> Sent from: http://apache-flink.147419.n8.nabble.com/
>>





--
Sent from: http://apache-flink.147419.n8.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: flink1.10 stop with a savepoint失败

LiangbinZhang
In reply to this post by zilong xiao
Hi,zilong
    的确是这个问题,感谢帮助。
Best,
Robin


zilong xiao wrote
> Hi Robin Zhang
> 你应该是遇到了这个issue报告的问题:https://issues.apache.org/jira/browse/FLINK-16626
> ,可以看下这个issue描述,祝好~
>
> Robin Zhang &lt;

> vincent2015qdlg@

> &gt; 于2020年10月19日周一 下午3:42写道:
>
>> 普通的source -> map -> filter-> sink 测试应用。
>>
>> 触发savepoint的脚本 :
>>     ${FLINK_HOME} stop -p ${TARGET_DIR} -d ${JOB_ID}
>> 具体报错信息:
>>
>> org.apache.flink.util.FlinkException: Could not stop with a savepoint job
>> "81990282a4686ebda3d04041e3620776".
>>         at
>> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:462)
>>         at
>>
>> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843)
>>         at
>> org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:454)
>>         at
>>
>> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:907)
>>         at
>>
>> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968)
>>         at java.security.AccessController.doPrivileged(Native Method)
>>         at javax.security.auth.Subject.doAs(Subject.java:422)
>>         at
>>
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
>>         at
>>
>> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>>         at
>> org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968)
>> Caused by: java.util.concurrent.TimeoutException
>>         at
>>
>> java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771)
>>         at
>> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
>>         at
>> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:460)
>>         ... 9 more
>>
>>
>>
>> 查看报错,怀疑是权限问题,我是root用户启动的应用,savepoint目录所在的hdfs路径权限所属也是root,如果不停止应用,直接触发savepoint没问题,继续定位到是root用户去停止hadoop
>> 应用遇到权限问题,但是不知道怎么解决,目前卡在这里。
>>
>>
>>
>> --
>> Sent from: http://apache-flink.147419.n8.nabble.com/
>>





--
Sent from: http://apache-flink.147419.n8.nabble.com/