flink做checkpoint失败 Checkpoint Coordinator is suspending.

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

flink做checkpoint失败 Checkpoint Coordinator is suspending.

chen310
This post was updated on .
flink做checkpoint一直失败,请教下是啥原因

<http://apache-flink.147419.n8.nabble.com/file/t572/config.config

<http://apache-flink.147419.n8.nabble.com/file/t572/history.history


<http://apache-flink.147419.n8.nabble.com/file/t572/history_detail.history_detail


job manager日志:
2021-02-01 08:54:43,639 ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler  [] - Exception occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found
2021-02-01 08:54:44,642 ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler  [] - Exception occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found
2021-02-01 08:54:45,644 ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler  [] - Exception occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found
2021-02-01 08:54:46,647 ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler  [] - Exception occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found
2021-02-01 08:54:47,649 ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler  [] - Exception occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found
2021-02-01 08:54:48,652 ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler  [] - Exception occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found
2021-02-01 08:54:49,655 ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler  [] - Exception occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found
2021-02-01 08:54:50,658 ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler  [] - Exception occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found
2021-02-01 08:54:50,921 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 8697 (type=CHECKPOINT) @ 1612169690917 for job 1299f2f27e56ec36a4e0ffd3472ad399.
2021-02-01 08:54:50,999 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Decline checkpoint 8697 by task 320d2c162f17265435777bb65e1a8934 of job 1299f2f27e56ec36a4e0ffd3472ad399 at container_e21_1596002540781_1159_01_000134 @ ip-10-120-83-22.ap-northeast-1.compute.internal (dataPort=42984).
2021-02-01 08:54:51,661 ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler  [] - Exception occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found
2021-02-01 08:54:52,654 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - GroupWindowAggregate(window=[SlidingGroupWindow('w$, requestDateTime, 1800000, 600000)], properties=[w$start, w$end, w$rowtime, w$proctime], select=[COUNT(DISTINCT $f1) AS totalCount, start('w$) AS w$start, end('w$) AS w$end, rowtime('w$) AS w$rowtime, proctime('w$) AS w$proctime]) -> Calc(select=[(UNIX_TIMESTAMP((w$start DATE_FORMAT _UTF-16LE'yyyy-MM-dd HH:mm:ss')) * 1000) AS requestTime, totalCount]) (1/1) (6beee54a923323c369b046e199f572c4) switched from RUNNING to FAILED on org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@379a8f9c.
java.io.IOException: Could not perform checkpoint 8697 for operator GroupWindowAggregate(window=[SlidingGroupWindow('w$, requestDateTime, 1800000, 600000)], properties=[w$start, w$end, w$rowtime, w$proctime], select=[COUNT(DISTINCT $f1) AS totalCount, start('w$) AS w$start, end('w$) AS w$end, rowtime('w$) AS w$rowtime, proctime('w$) AS w$proctime]) -> Calc(select=[(UNIX_TIMESTAMP((w$start DATE_FORMAT _UTF-16LE'yyyy-MM-dd HH:mm:ss')) * 1000) AS requestTime, totalCount]) (1/1).
        at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:897) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
        at org.apache.flink.streaming.runtime.io.CheckpointBarrierHandler.notifyCheckpoint(CheckpointBarrierHandler.java:113) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
        at org.apache.flink.streaming.runtime.io.CheckpointBarrierAligner.processBarrier(CheckpointBarrierAligner.java:137) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
        at org.apache.flink.streaming.runtime.io.CheckpointedInputGate.pollNext(CheckpointedInputGate.java:93) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
        at org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.emitNext(StreamTaskNetworkInput.java:158) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
        at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:67) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
                at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:351) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
        at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxStep(MailboxProcessor.java:191) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
        at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:181) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
        at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:567) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
        at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:536) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
        at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:721) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:546) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
        at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_181]
Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: Could not complete snapshot 8697 for operator GroupWindowAggregate(window=[SlidingGroupWindow('w$, requestDateTime, 1800000, 600000)], properties=[w$start, w$end, w$rowtime, w$proctime], select=[COUNT(DISTINCT $f1) AS totalCount, start('w$) AS w$start, end('w$) AS w$end, rowtime('w$) AS w$rowtime, proctime('w$) AS w$proctime]) -> Calc(select=[(UNIX_TIMESTAMP((w$start DATE_FORMAT _UTF-16LE'yyyy-MM-dd HH:mm:ss')) * 1000) AS requestTime, totalCount]) (1/1). Failure reason: Checkpoint was declined.
        at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:215) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
        at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:156) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
        at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:314) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
        at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointStreamOperator(SubtaskCheckpointCoordinatorImpl.java:614) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
        at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.buildOperatorSnapshotFutures(SubtaskCheckpointCoordinatorImpl.java:540) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
        at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.takeSnapshotSync(SubtaskCheckpointCoordinatorImpl.java:507) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
        at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState(SubtaskCheckpointCoordinatorImpl.java:266) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
        at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$8(StreamTask.java:926) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
        at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:47) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
        at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:916) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
        at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:884) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
        ... 13 more
Caused by: org.apache.flink.util.SerializedThrowable: While open a file for appending: /server/yarn/nm/usercache/yarn/appcache/application_1596002540781_1159/flink-io-1ad6bdc6-aea8-4dc5-a133-7c7b5e2361fe/job_1299f2f27e56ec36a4e0ffd3472ad399_op_AggregateWindowOperator_fa157648fdadffa65122f5b4200f4fda__1_1__uuid_9744ef17-bf12-471c-b486-19140201517f/db/038968.sst: Too many open files
        at org.rocksdb.Checkpoint.createCheckpoint(Native Method) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
        at org.rocksdb.Checkpoint.createCheckpoint(Checkpoint.java:51) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
        at org.apache.flink.contrib.streaming.state.snapshot.RocksIncrementalSnapshotStrategy.takeDBNativeCheckpoint(RocksIncrementalSnapshotStrategy.java:255) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
        at org.apache.flink.contrib.streaming.state.snapshot.RocksIncrementalSnapshotStrategy.doSnapshot(RocksIncrementalSnapshotStrategy.java:159) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
        at org.apache.flink.contrib.streaming.state.snapshot.RocksDBSnapshotStrategyBase.snapshot(RocksDBSnapshotStrategyBase.java:126) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
        at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.snapshot(RocksDBKeyedStateBackend.java:459) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
        at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:198) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
        at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:156) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
        at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:314) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
        at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointStreamOperator(SubtaskCheckpointCoordinatorImpl.java:614) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
        at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.buildOperatorSnapshotFutures(SubtaskCheckpointCoordinatorImpl.java:540) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
        at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.takeSnapshotSync(SubtaskCheckpointCoordinatorImpl.java:507) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
        at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState(SubtaskCheckpointCoordinatorImpl.java:266) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
        at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$8(StreamTask.java:926) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
        at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:47) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
        at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:916) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
        at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:884) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
        ... 13 more
2021-02-01 08:54:52,654 INFO  org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy [] - Calculating tasks to restart to recover the failed task fa157648fdadffa65122f5b4200f4fda_0.
2021-02-01 08:54:52,654 INFO  org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy [] - 7 tasks should be restarted to recover the failed task fa157648fdadffa65122f5b4200f4fda_0.
2021-02-01 08:54:52,654 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job insert-into_default_catalog.default_database.risk_final_accept_sink,default_catalog.default_database.risk_final_accept_grafana_sink (1299f2f27e56ec36a4e0ffd3472ad399) switched from state RUNNING to RESTARTING.
2021-02-01 08:54:52,654 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - GroupWindowAggregate(window=[SlidingGroupWindow('w$, requestDateTime, 1800000, 600000)], properties=[w$start, w$end, w$rowtime, w$proctime], select=[COUNT(DISTINCT merchantReferenceCode) AS acceptCount, start('w$) AS w$start, end('w$) AS w$end, rowtime('w$) AS w$rowtime, proctime('w$) AS w$proctime]) -> Calc(select=[_UTF-16LE'risk_final_accept_hop10min30min' AS eventCode, (w$start DATE_FORMAT _UTF-16LE'yyyy-MM-dd HH:mm:ss') AS timeStart, (w$end DATE_FORMAT _UTF-16LE'yyyy-MM-dd HH:mm:ss') AS timeEnd, (UNIX_TIMESTAMP((w$start DATE_FORMAT _UTF-16LE'yyyy-MM-dd HH:mm:ss')) * 1000) AS requestTime, _UTF-16LE'0' AS userId, acceptCount]) (1/1) (52f55328f6bf756dd1c63bb0d149e55b) switched from RUNNING to CANCELING.




--
Sent from: http://apache-flink.147419.n8.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: flink做checkpoint失败 Checkpoint Coordinator is suspending.

Congxian Qiu
Hi
     你 flink 是什么版本,以及你作业 checkpoint/state 相关的配置是什么呢?如果可以的话,把完整的 jm log 发一下
Best,
Congxian


chen310 <[hidden email]> 于2021年2月1日周一 下午5:41写道:

> 补充下,jobmanager日志异常:
>
> 2021-02-01 08:54:43,639 ERROR
> org.apache.flink.runtime.rest.handler.job.JobDetailsHandler  [] - Exception
> occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found
> 2021-02-01 08:54:44,642 ERROR
> org.apache.flink.runtime.rest.handler.job.JobDetailsHandler  [] - Exception
> occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found
> 2021-02-01 08:54:45,644 ERROR
> org.apache.flink.runtime.rest.handler.job.JobDetailsHandler  [] - Exception
> occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found
> 2021-02-01 08:54:46,647 ERROR
> org.apache.flink.runtime.rest.handler.job.JobDetailsHandler  [] - Exception
> occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found
> 2021-02-01 08:54:47,649 ERROR
> org.apache.flink.runtime.rest.handler.job.JobDetailsHandler  [] - Exception
> occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found
> 2021-02-01 08:54:48,652 ERROR
> org.apache.flink.runtime.rest.handler.job.JobDetailsHandler  [] - Exception
> occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found
> 2021-02-01 08:54:49,655 ERROR
> org.apache.flink.runtime.rest.handler.job.JobDetailsHandler  [] - Exception
> occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found
> 2021-02-01 08:54:50,658 ERROR
> org.apache.flink.runtime.rest.handler.job.JobDetailsHandler  [] - Exception
> occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found
> 2021-02-01 08:54:50,921 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] -
> Triggering
> checkpoint 8697 (type=CHECKPOINT) @ 1612169690917 for job
> 1299f2f27e56ec36a4e0ffd3472ad399.
> 2021-02-01 08:54:50,999 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Decline
> checkpoint 8697 by task 320d2c162f17265435777bb65e1a8934 of job
> 1299f2f27e56ec36a4e0ffd3472ad399 at
> container_e21_1596002540781_1159_01_000134 @
> ip-10-120-83-22.ap-northeast-1.compute.internal (dataPort=42984).
> 2021-02-01 08:54:51,661 ERROR
> org.apache.flink.runtime.rest.handler.job.JobDetailsHandler  [] - Exception
> occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found
> 2021-02-01 08:54:52,654 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> GroupWindowAggregate(window=[SlidingGroupWindow('w$, requestDateTime,
> 1800000, 600000)], properties=[w$start, w$end, w$rowtime, w$proctime],
> select=[COUNT(DISTINCT $f1) AS totalCount, start('w$) AS w$start, end('w$)
> AS w$end, rowtime('w$) AS w$rowtime, proctime('w$) AS w$proctime]) ->
> Calc(select=[(UNIX_TIMESTAMP((w$start DATE_FORMAT _UTF-16LE'yyyy-MM-dd
> HH:mm:ss')) * 1000) AS requestTime, totalCount]) (1/1)
> (6beee54a923323c369b046e199f572c4) switched from RUNNING to FAILED on
> org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@379a8f9c.
> java.io.IOException: Could not perform checkpoint 8697 for operator
> GroupWindowAggregate(window=[SlidingGroupWindow('w$, requestDateTime,
> 1800000, 600000)], properties=[w$start, w$end, w$rowtime, w$proctime],
> select=[COUNT(DISTINCT $f1) AS totalCount, start('w$) AS w$start, end('w$)
> AS w$end, rowtime('w$) AS w$rowtime, proctime('w$) AS w$proctime]) ->
> Calc(select=[(UNIX_TIMESTAMP((w$start DATE_FORMAT _UTF-16LE'yyyy-MM-dd
> HH:mm:ss')) * 1000) AS requestTime, totalCount]) (1/1).
>         at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:897)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at
> org.apache.flink.streaming.runtime.io
> .CheckpointBarrierHandler.notifyCheckpoint(CheckpointBarrierHandler.java:113)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at
> org.apache.flink.streaming.runtime.io
> .CheckpointBarrierAligner.processBarrier(CheckpointBarrierAligner.java:137)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at
> org.apache.flink.streaming.runtime.io
> .CheckpointedInputGate.pollNext(CheckpointedInputGate.java:93)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at
> org.apache.flink.streaming.runtime.io
> .StreamTaskNetworkInput.emitNext(StreamTaskNetworkInput.java:158)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at
> org.apache.flink.streaming.runtime.io
> .StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:67)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>                 at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:351)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at
>
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxStep(MailboxProcessor.java:191)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at
>
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:181)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:567)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:536)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:721)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at org.apache.flink.runtime.taskmanager.Task.run(Task.java:546)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_181]
> Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: Could
> not complete snapshot 8697 for operator
> GroupWindowAggregate(window=[SlidingGroupWindow('w$, requestDateTime,
> 1800000, 600000)], properties=[w$start, w$end, w$rowtime, w$proctime],
> select=[COUNT(DISTINCT $f1) AS totalCount, start('w$) AS w$start, end('w$)
> AS w$end, rowtime('w$) AS w$rowtime, proctime('w$) AS w$proctime]) ->
> Calc(select=[(UNIX_TIMESTAMP((w$start DATE_FORMAT _UTF-16LE'yyyy-MM-dd
> HH:mm:ss')) * 1000) AS requestTime, totalCount]) (1/1). Failure reason:
> Checkpoint was declined.
>         at
>
> org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:215)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at
>
> org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:156)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at
>
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:314)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at
>
> org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointStreamOperator(SubtaskCheckpointCoordinatorImpl.java:614)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at
>
> org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.buildOperatorSnapshotFutures(SubtaskCheckpointCoordinatorImpl.java:540)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at
>
> org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.takeSnapshotSync(SubtaskCheckpointCoordinatorImpl.java:507)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at
>
> org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState(SubtaskCheckpointCoordinatorImpl.java:266)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$8(StreamTask.java:926)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at
>
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:47)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:916)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:884)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         ... 13 more
> Caused by: org.apache.flink.util.SerializedThrowable: While open a file for
> appending:
>
> /server/yarn/nm/usercache/yarn/appcache/application_1596002540781_1159/flink-io-1ad6bdc6-aea8-4dc5-a133-7c7b5e2361fe/job_1299f2f27e56ec36a4e0ffd3472ad399_op_AggregateWindowOperator_fa157648fdadffa65122f5b4200f4fda__1_1__uuid_9744ef17-bf12-471c-b486-19140201517f/db/038968.sst:
> Too many open files
>         at org.rocksdb.Checkpoint.createCheckpoint(Native Method)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at org.rocksdb.Checkpoint.createCheckpoint(Checkpoint.java:51)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at
>
> org.apache.flink.contrib.streaming.state.snapshot.RocksIncrementalSnapshotStrategy.takeDBNativeCheckpoint(RocksIncrementalSnapshotStrategy.java:255)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at
>
> org.apache.flink.contrib.streaming.state.snapshot.RocksIncrementalSnapshotStrategy.doSnapshot(RocksIncrementalSnapshotStrategy.java:159)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at
>
> org.apache.flink.contrib.streaming.state.snapshot.RocksDBSnapshotStrategyBase.snapshot(RocksDBSnapshotStrategyBase.java:126)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at
>
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.snapshot(RocksDBKeyedStateBackend.java:459)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at
>
> org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:198)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at
>
> org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:156)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at
>
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:314)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at
>
> org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointStreamOperator(SubtaskCheckpointCoordinatorImpl.java:614)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at
>
> org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.buildOperatorSnapshotFutures(SubtaskCheckpointCoordinatorImpl.java:540)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at
>
> org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.takeSnapshotSync(SubtaskCheckpointCoordinatorImpl.java:507)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at
>
> org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState(SubtaskCheckpointCoordinatorImpl.java:266)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$8(StreamTask.java:926)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at
>
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:47)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:916)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:884)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>         ... 13 more
> 2021-02-01 08:54:52,654 INFO
>
> org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
> [] - Calculating tasks to restart to recover the failed task
> fa157648fdadffa65122f5b4200f4fda_0.
> 2021-02-01 08:54:52,654 INFO
>
> org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
> [] - 7 tasks should be restarted to recover the failed task
> fa157648fdadffa65122f5b4200f4fda_0.
> 2021-02-01 08:54:52,654 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job
>
> insert-into_default_catalog.default_database.risk_final_accept_sink,default_catalog.default_database.risk_final_accept_grafana_sink
> (1299f2f27e56ec36a4e0ffd3472ad399) switched from state RUNNING to
> RESTARTING.
> 2021-02-01 08:54:52,654 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> GroupWindowAggregate(window=[SlidingGroupWindow('w$, requestDateTime,
> 1800000, 600000)], properties=[w$start, w$end, w$rowtime, w$proctime],
> select=[COUNT(DISTINCT merchantReferenceCode) AS acceptCount, start('w$) AS
> w$start, end('w$) AS w$end, rowtime('w$) AS w$rowtime, proctime('w$) AS
> w$proctime]) -> Calc(select=[_UTF-16LE'risk_final_accept_hop10min30min' AS
> eventCode, (w$start DATE_FORMAT _UTF-16LE'yyyy-MM-dd HH:mm:ss') AS
> timeStart, (w$end DATE_FORMAT _UTF-16LE'yyyy-MM-dd HH:mm:ss') AS timeEnd,
> (UNIX_TIMESTAMP((w$start DATE_FORMAT _UTF-16LE'yyyy-MM-dd HH:mm:ss')) *
> 1000) AS requestTime, _UTF-16LE'0' AS userId, acceptCount]) (1/1)
> (52f55328f6bf756dd1c63bb0d149e55b) switched from RUNNING to CANCELING.
>
>
>
>
> --
> Sent from: http://apache-flink.147419.n8.nabble.com/
>
Reply | Threaded
Open this post in threaded view
|

Re: flink做checkpoint失败 Checkpoint Coordinator is suspending.

chen310
This post was updated on .
flink版本是1.12-SNAPSHOT,这个版本是1.11发布后不久为了使用1.12一个新特性,新打的一个包,代码应该比较靠近1.11,checkpoint配置是:

pipeline.time-characteristic EventTime
execution.checkpointing.interval 600000
execution.checkpointing.min-pause 120000
execution.checkpointing.timeout 120000
execution.checkpointing.externalized-checkpoint-retention
RETAIN_ON_CANCELLATION
state.backend rocksdb
state.backend.incremental true
state.checkpoints.dir hdfs:///tmp/flink/checkpoint

完整的jm log很大,1g多,上面贴的是关键的错误信息



--
Sent from: http://apache-flink.147419.n8.nabble.com/