Flink 1.10.1 checkpoint失败问题

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink 1.10.1 checkpoint失败问题

Storm☀️
各位好,checkpoint相关问题L

flink版本1.10.1:,个别的checkpoint过程发生问题:
java.lang.Exception: Could not perform checkpoint 1194 for operator Map
(3/3).
        at
org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:816)
        at
org.apache.flink.streaming.runtime.io.CheckpointBarrierHandler.notifyCheckpoint(CheckpointBarrierHandler.java:86)
        at
org.apache.flink.streaming.runtime.io.CheckpointBarrierTracker.processBarrier(CheckpointBarrierTracker.java:99)
        at
org.apache.flink.streaming.runtime.io.CheckpointedInputGate.pollNext(CheckpointedInputGate.java:155)
        at
org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.emitNext(StreamTaskNetworkInput.java:133)
        at
org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:69)
        at
org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:310)
        at
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:187)
        at
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:485)
        at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:469)
        at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:708)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:533)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
        at
org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1382)
        at
org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:974)
        at
org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$5(StreamTask.java:870)
        at
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94)
        at
org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:843)
        at
org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:803)
        ... 12 mor

绝大部分是正常完成的,但是小部分比如上面的情况,就会失败,还会导致suspending-->restart.



--
Sent from: http://apache-flink.147419.n8.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.10.1 checkpoint失败问题

Congxian Qiu
Hi
    这个问题是应该和 FLINK-17479 是一样的,是特定 JDK 上会遇到问题,可以考虑升级一下 flink 版本,或者替换一个 JDK 版本

Best,
Congxian


Storm☀️ <[hidden email]> 于2020年9月27日周日 上午10:17写道:

> 各位好,checkpoint相关问题L
>
> flink版本1.10.1:,个别的checkpoint过程发生问题:
> java.lang.Exception: Could not perform checkpoint 1194 for operator Map
> (3/3).
>         at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:816)
>         at
> org.apache.flink.streaming.runtime.io
> .CheckpointBarrierHandler.notifyCheckpoint(CheckpointBarrierHandler.java:86)
>         at
> org.apache.flink.streaming.runtime.io
> .CheckpointBarrierTracker.processBarrier(CheckpointBarrierTracker.java:99)
>         at
> org.apache.flink.streaming.runtime.io
> .CheckpointedInputGate.pollNext(CheckpointedInputGate.java:155)
>         at
> org.apache.flink.streaming.runtime.io
> .StreamTaskNetworkInput.emitNext(StreamTaskNetworkInput.java:133)
>         at
> org.apache.flink.streaming.runtime.io
> .StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:69)
>         at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:310)
>         at
>
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:187)
>         at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:485)
>         at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:469)
>         at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:708)
>         at org.apache.flink.runtime.taskmanager.Task.run(Task.java:533)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>         at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1382)
>         at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:974)
>         at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$5(StreamTask.java:870)
>         at
>
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94)
>         at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:843)
>         at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:803)
>         ... 12 mor
>
> 绝大部分是正常完成的,但是小部分比如上面的情况,就会失败,还会导致suspending-->restart.
>
>
>
> --
> Sent from: http://apache-flink.147419.n8.nabble.com/
>
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.10.1 checkpoint失败问题

Storm☀️
谢谢
我看了那个issue,有问题的是jdk 1.8_060版本的,我们用的是074版本的。
我测试环境尝试升级一下jdk到251版本。



--
Sent from: http://apache-flink.147419.n8.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.10.1 checkpoint失败问题

Storm☀️
In reply to this post by Congxian Qiu
尝试了将jdk升级到了261,报错依然还有。



--
Sent from: http://apache-flink.147419.n8.nabble.com/
Reply | Threaded
Open this post in threaded view
|

回复: Flink 1.10.1 checkpoint失败问题

大森林
我这边是老版本的jdk8,和jdk261没啥关系的




------------------&nbsp;原始邮件&nbsp;------------------
发件人:                                                                                                                        "user-zh"                                                                                    <[hidden email]&gt;;
发送时间:&nbsp;2020年10月10日(星期六) 上午9:03
收件人:&nbsp;"user-zh"<[hidden email]&gt;;

主题:&nbsp;Re: Flink 1.10.1 checkpoint失败问题



尝试了将jdk升级到了261,报错依然还有。



--
Sent from: http://apache-flink.147419.n8.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.10.1 checkpoint失败问题

Congxian Qiu
Hi, @Storm 请问你用的是 flink 是哪个版本,然后栈是什么呢?可以把相关性信息回复到这里,可以一起看看是啥问题

Best,
Congxian


大森林 <[hidden email]> 于2020年10月10日周六 下午1:05写道:

> 我这边是老版本的jdk8,和jdk261没啥关系的
>
>
>
>
> ------------------&nbsp;原始邮件&nbsp;------------------
> 发件人:
>                                                   "user-zh"
>                                                                     <
> [hidden email]&gt;;
> 发送时间:&nbsp;2020年10月10日(星期六) 上午9:03
> 收件人:&nbsp;"user-zh"<[hidden email]&gt;;
>
> 主题:&nbsp;Re: Flink 1.10.1 checkpoint失败问题
>
>
>
> 尝试了将jdk升级到了261,报错依然还有。
>
>
>
> --
> Sent from: http://apache-flink.147419.n8.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.10.1 checkpoint失败问题

Storm☀️
flink版本:Flink1.10.1
部署方式:flink on yarn
hadoop版本:cdh5.15.2-2.6.0
现状:Checkpoint Counts Triggered: 9339In Progress: 0Completed: 8439Failed:
900Restored: 7
错误信息:
ava.lang.Exception: Could not perform checkpoint 1194 for operator Map
(3/3).
        at
org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:816)
        at
org.apache.flink.streaming.runtime.io.CheckpointBarrierHandler.notifyCheckpoint(CheckpointBarrierHandler.java:86)
        at
org.apache.flink.streaming.runtime.io.CheckpointBarrierTracker.processBarrier(CheckpointBarrierTracker.java:99)
        at
org.apache.flink.streaming.runtime.io.CheckpointedInputGate.pollNext(CheckpointedInputGate.java:155)
        at
org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.emitNext(StreamTaskNetworkInput.java:133)
        at
org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:69)
        at
org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:310)
        at
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:187)
        at
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:485)
        at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:469)
        at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:708)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:533)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
        at
org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1382)
        at
org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:974)
        at
org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$5(StreamTask.java:870)
        at
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94)
        at
org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:843)
        at
org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:803)
        ... 12 more


同样的程序在11.2的版本上,chk是完全正常的。





--
Sent from: http://apache-flink.147419.n8.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re:Re: Flink 1.10.1 checkpoint失败问题

hailongwang


在我们 1.10 版本的生产环境上这个问题也确实出现过,也有几个 issue 在讨论这个,比如:
https://issues.apache.org/jira/browse/FLINK-18196
其中说了2个方法,曾经也试过:
1、是换 JDK 版本,这个没有试过,因为需要更新 NodeManeger 的 JDK,代价比较高;
2、重新 new 一个 CheckpointMetaData,通过修改这个,生产环境上确实没有出现过这个问题了,但是本质原因不太清楚。
希望这些可以帮助到你


Best,
Hailong Wang




在 2020-10-13 18:04:11,"Storm☀️" <[hidden email]> 写道:

>flink版本:Flink1.10.1
>部署方式:flink on yarn
>hadoop版本:cdh5.15.2-2.6.0
>现状:Checkpoint Counts Triggered: 9339In Progress: 0Completed: 8439Failed:
>900Restored: 7
>错误信息:
>ava.lang.Exception: Could not perform checkpoint 1194 for operator Map
>(3/3).
>        at
>org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:816)
>        at
>org.apache.flink.streaming.runtime.io.CheckpointBarrierHandler.notifyCheckpoint(CheckpointBarrierHandler.java:86)
>        at
>org.apache.flink.streaming.runtime.io.CheckpointBarrierTracker.processBarrier(CheckpointBarrierTracker.java:99)
>        at
>org.apache.flink.streaming.runtime.io.CheckpointedInputGate.pollNext(CheckpointedInputGate.java:155)
>        at
>org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.emitNext(StreamTaskNetworkInput.java:133)
>        at
>org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:69)
>        at
>org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:310)
>        at
>org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:187)
>        at
>org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:485)
>        at
>org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:469)
>        at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:708)
>        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:533)
>        at java.lang.Thread.run(Thread.java:745)
>Caused by: java.lang.NullPointerException
>        at
>org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1382)
>        at
>org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:974)
>        at
>org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$5(StreamTask.java:870)
>        at
>org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94)
>        at
>org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:843)
>        at
>org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:803)
>        ... 12 more
>
>
>同样的程序在11.2的版本上,chk是完全正常的。
>
>
>
>
>
>--
>Sent from: http://apache-flink.147419.n8.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Re:Re: Flink 1.10.1 checkpoint失败问题

Storm☀️
非常感谢。
后续我关注下这个问题,有结论反馈给大家,供参考。



--
Sent from: http://apache-flink.147419.n8.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Re:Re: Flink 1.10.1 checkpoint失败问题

Congxian Qiu
FYI  分享一个可能相关的文章[1]

[1] https://cloud.tencent.com/developer/news/564780

Best,
Congxian


Storm☀️ <[hidden email]> 于2020年10月15日周四 上午10:42写道:

> 非常感谢。
> 后续我关注下这个问题,有结论反馈给大家,供参考。
>
>
>
> --
> Sent from: http://apache-flink.147419.n8.nabble.com/