This post was updated on .
flink做checkpoint一直失败,请教下是啥原因
<http://apache-flink.147419.n8.nabble.com/file/t572/config.config> <http://apache-flink.147419.n8.nabble.com/file/t572/history.history> <http://apache-flink.147419.n8.nabble.com/file/t572/history_detail.history_detail> job manager日志: 2021-02-01 08:54:43,639 ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler [] - Exception occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found 2021-02-01 08:54:44,642 ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler [] - Exception occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found 2021-02-01 08:54:45,644 ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler [] - Exception occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found 2021-02-01 08:54:46,647 ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler [] - Exception occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found 2021-02-01 08:54:47,649 ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler [] - Exception occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found 2021-02-01 08:54:48,652 ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler [] - Exception occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found 2021-02-01 08:54:49,655 ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler [] - Exception occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found 2021-02-01 08:54:50,658 ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler [] - Exception occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found 2021-02-01 08:54:50,921 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 8697 (type=CHECKPOINT) @ 1612169690917 for job 1299f2f27e56ec36a4e0ffd3472ad399. 2021-02-01 08:54:50,999 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Decline checkpoint 8697 by task 320d2c162f17265435777bb65e1a8934 of job 1299f2f27e56ec36a4e0ffd3472ad399 at container_e21_1596002540781_1159_01_000134 @ ip-10-120-83-22.ap-northeast-1.compute.internal (dataPort=42984). 2021-02-01 08:54:51,661 ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler [] - Exception occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found 2021-02-01 08:54:52,654 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - GroupWindowAggregate(window=[SlidingGroupWindow('w$, requestDateTime, 1800000, 600000)], properties=[w$start, w$end, w$rowtime, w$proctime], select=[COUNT(DISTINCT $f1) AS totalCount, start('w$) AS w$start, end('w$) AS w$end, rowtime('w$) AS w$rowtime, proctime('w$) AS w$proctime]) -> Calc(select=[(UNIX_TIMESTAMP((w$start DATE_FORMAT _UTF-16LE'yyyy-MM-dd HH:mm:ss')) * 1000) AS requestTime, totalCount]) (1/1) (6beee54a923323c369b046e199f572c4) switched from RUNNING to FAILED on org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@379a8f9c. java.io.IOException: Could not perform checkpoint 8697 for operator GroupWindowAggregate(window=[SlidingGroupWindow('w$, requestDateTime, 1800000, 600000)], properties=[w$start, w$end, w$rowtime, w$proctime], select=[COUNT(DISTINCT $f1) AS totalCount, start('w$) AS w$start, end('w$) AS w$end, rowtime('w$) AS w$rowtime, proctime('w$) AS w$proctime]) -> Calc(select=[(UNIX_TIMESTAMP((w$start DATE_FORMAT _UTF-16LE'yyyy-MM-dd HH:mm:ss')) * 1000) AS requestTime, totalCount]) (1/1). at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:897) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] at org.apache.flink.streaming.runtime.io.CheckpointBarrierHandler.notifyCheckpoint(CheckpointBarrierHandler.java:113) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] at org.apache.flink.streaming.runtime.io.CheckpointBarrierAligner.processBarrier(CheckpointBarrierAligner.java:137) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] at org.apache.flink.streaming.runtime.io.CheckpointedInputGate.pollNext(CheckpointedInputGate.java:93) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] at org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.emitNext(StreamTaskNetworkInput.java:158) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:67) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:351) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxStep(MailboxProcessor.java:191) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:181) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:567) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:536) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:721) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] at org.apache.flink.runtime.taskmanager.Task.run(Task.java:546) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_181] Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: Could not complete snapshot 8697 for operator GroupWindowAggregate(window=[SlidingGroupWindow('w$, requestDateTime, 1800000, 600000)], properties=[w$start, w$end, w$rowtime, w$proctime], select=[COUNT(DISTINCT $f1) AS totalCount, start('w$) AS w$start, end('w$) AS w$end, rowtime('w$) AS w$rowtime, proctime('w$) AS w$proctime]) -> Calc(select=[(UNIX_TIMESTAMP((w$start DATE_FORMAT _UTF-16LE'yyyy-MM-dd HH:mm:ss')) * 1000) AS requestTime, totalCount]) (1/1). Failure reason: Checkpoint was declined. at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:215) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:156) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:314) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointStreamOperator(SubtaskCheckpointCoordinatorImpl.java:614) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.buildOperatorSnapshotFutures(SubtaskCheckpointCoordinatorImpl.java:540) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.takeSnapshotSync(SubtaskCheckpointCoordinatorImpl.java:507) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState(SubtaskCheckpointCoordinatorImpl.java:266) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$8(StreamTask.java:926) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:47) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:916) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:884) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] ... 13 more Caused by: org.apache.flink.util.SerializedThrowable: While open a file for appending: /server/yarn/nm/usercache/yarn/appcache/application_1596002540781_1159/flink-io-1ad6bdc6-aea8-4dc5-a133-7c7b5e2361fe/job_1299f2f27e56ec36a4e0ffd3472ad399_op_AggregateWindowOperator_fa157648fdadffa65122f5b4200f4fda__1_1__uuid_9744ef17-bf12-471c-b486-19140201517f/db/038968.sst: Too many open files at org.rocksdb.Checkpoint.createCheckpoint(Native Method) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] at org.rocksdb.Checkpoint.createCheckpoint(Checkpoint.java:51) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] at org.apache.flink.contrib.streaming.state.snapshot.RocksIncrementalSnapshotStrategy.takeDBNativeCheckpoint(RocksIncrementalSnapshotStrategy.java:255) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] at org.apache.flink.contrib.streaming.state.snapshot.RocksIncrementalSnapshotStrategy.doSnapshot(RocksIncrementalSnapshotStrategy.java:159) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] at org.apache.flink.contrib.streaming.state.snapshot.RocksDBSnapshotStrategyBase.snapshot(RocksDBSnapshotStrategyBase.java:126) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.snapshot(RocksDBKeyedStateBackend.java:459) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:198) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:156) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:314) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointStreamOperator(SubtaskCheckpointCoordinatorImpl.java:614) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.buildOperatorSnapshotFutures(SubtaskCheckpointCoordinatorImpl.java:540) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.takeSnapshotSync(SubtaskCheckpointCoordinatorImpl.java:507) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState(SubtaskCheckpointCoordinatorImpl.java:266) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$8(StreamTask.java:926) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:47) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:916) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:884) ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] ... 13 more 2021-02-01 08:54:52,654 INFO org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy [] - Calculating tasks to restart to recover the failed task fa157648fdadffa65122f5b4200f4fda_0. 2021-02-01 08:54:52,654 INFO org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy [] - 7 tasks should be restarted to recover the failed task fa157648fdadffa65122f5b4200f4fda_0. 2021-02-01 08:54:52,654 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job insert-into_default_catalog.default_database.risk_final_accept_sink,default_catalog.default_database.risk_final_accept_grafana_sink (1299f2f27e56ec36a4e0ffd3472ad399) switched from state RUNNING to RESTARTING. 2021-02-01 08:54:52,654 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - GroupWindowAggregate(window=[SlidingGroupWindow('w$, requestDateTime, 1800000, 600000)], properties=[w$start, w$end, w$rowtime, w$proctime], select=[COUNT(DISTINCT merchantReferenceCode) AS acceptCount, start('w$) AS w$start, end('w$) AS w$end, rowtime('w$) AS w$rowtime, proctime('w$) AS w$proctime]) -> Calc(select=[_UTF-16LE'risk_final_accept_hop10min30min' AS eventCode, (w$start DATE_FORMAT _UTF-16LE'yyyy-MM-dd HH:mm:ss') AS timeStart, (w$end DATE_FORMAT _UTF-16LE'yyyy-MM-dd HH:mm:ss') AS timeEnd, (UNIX_TIMESTAMP((w$start DATE_FORMAT _UTF-16LE'yyyy-MM-dd HH:mm:ss')) * 1000) AS requestTime, _UTF-16LE'0' AS userId, acceptCount]) (1/1) (52f55328f6bf756dd1c63bb0d149e55b) switched from RUNNING to CANCELING. -- Sent from: http://apache-flink.147419.n8.nabble.com/ |
Hi
你 flink 是什么版本,以及你作业 checkpoint/state 相关的配置是什么呢?如果可以的话,把完整的 jm log 发一下 Best, Congxian chen310 <[hidden email]> 于2021年2月1日周一 下午5:41写道: > 补充下,jobmanager日志异常: > > 2021-02-01 08:54:43,639 ERROR > org.apache.flink.runtime.rest.handler.job.JobDetailsHandler [] - Exception > occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found > 2021-02-01 08:54:44,642 ERROR > org.apache.flink.runtime.rest.handler.job.JobDetailsHandler [] - Exception > occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found > 2021-02-01 08:54:45,644 ERROR > org.apache.flink.runtime.rest.handler.job.JobDetailsHandler [] - Exception > occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found > 2021-02-01 08:54:46,647 ERROR > org.apache.flink.runtime.rest.handler.job.JobDetailsHandler [] - Exception > occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found > 2021-02-01 08:54:47,649 ERROR > org.apache.flink.runtime.rest.handler.job.JobDetailsHandler [] - Exception > occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found > 2021-02-01 08:54:48,652 ERROR > org.apache.flink.runtime.rest.handler.job.JobDetailsHandler [] - Exception > occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found > 2021-02-01 08:54:49,655 ERROR > org.apache.flink.runtime.rest.handler.job.JobDetailsHandler [] - Exception > occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found > 2021-02-01 08:54:50,658 ERROR > org.apache.flink.runtime.rest.handler.job.JobDetailsHandler [] - Exception > occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found > 2021-02-01 08:54:50,921 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - > Triggering > checkpoint 8697 (type=CHECKPOINT) @ 1612169690917 for job > 1299f2f27e56ec36a4e0ffd3472ad399. > 2021-02-01 08:54:50,999 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Decline > checkpoint 8697 by task 320d2c162f17265435777bb65e1a8934 of job > 1299f2f27e56ec36a4e0ffd3472ad399 at > container_e21_1596002540781_1159_01_000134 @ > ip-10-120-83-22.ap-northeast-1.compute.internal (dataPort=42984). > 2021-02-01 08:54:51,661 ERROR > org.apache.flink.runtime.rest.handler.job.JobDetailsHandler [] - Exception > occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found > 2021-02-01 08:54:52,654 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > GroupWindowAggregate(window=[SlidingGroupWindow('w$, requestDateTime, > 1800000, 600000)], properties=[w$start, w$end, w$rowtime, w$proctime], > select=[COUNT(DISTINCT $f1) AS totalCount, start('w$) AS w$start, end('w$) > AS w$end, rowtime('w$) AS w$rowtime, proctime('w$) AS w$proctime]) -> > Calc(select=[(UNIX_TIMESTAMP((w$start DATE_FORMAT _UTF-16LE'yyyy-MM-dd > HH:mm:ss')) * 1000) AS requestTime, totalCount]) (1/1) > (6beee54a923323c369b046e199f572c4) switched from RUNNING to FAILED on > org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@379a8f9c. > java.io.IOException: Could not perform checkpoint 8697 for operator > GroupWindowAggregate(window=[SlidingGroupWindow('w$, requestDateTime, > 1800000, 600000)], properties=[w$start, w$end, w$rowtime, w$proctime], > select=[COUNT(DISTINCT $f1) AS totalCount, start('w$) AS w$start, end('w$) > AS w$end, rowtime('w$) AS w$rowtime, proctime('w$) AS w$proctime]) -> > Calc(select=[(UNIX_TIMESTAMP((w$start DATE_FORMAT _UTF-16LE'yyyy-MM-dd > HH:mm:ss')) * 1000) AS requestTime, totalCount]) (1/1). > at > > org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:897) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > org.apache.flink.streaming.runtime.io > .CheckpointBarrierHandler.notifyCheckpoint(CheckpointBarrierHandler.java:113) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > org.apache.flink.streaming.runtime.io > .CheckpointBarrierAligner.processBarrier(CheckpointBarrierAligner.java:137) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > org.apache.flink.streaming.runtime.io > .CheckpointedInputGate.pollNext(CheckpointedInputGate.java:93) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > org.apache.flink.streaming.runtime.io > .StreamTaskNetworkInput.emitNext(StreamTaskNetworkInput.java:158) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > org.apache.flink.streaming.runtime.io > .StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:67) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > > org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:351) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxStep(MailboxProcessor.java:191) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:181) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > > org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:567) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:536) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:721) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:546) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_181] > Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: Could > not complete snapshot 8697 for operator > GroupWindowAggregate(window=[SlidingGroupWindow('w$, requestDateTime, > 1800000, 600000)], properties=[w$start, w$end, w$rowtime, w$proctime], > select=[COUNT(DISTINCT $f1) AS totalCount, start('w$) AS w$start, end('w$) > AS w$end, rowtime('w$) AS w$rowtime, proctime('w$) AS w$proctime]) -> > Calc(select=[(UNIX_TIMESTAMP((w$start DATE_FORMAT _UTF-16LE'yyyy-MM-dd > HH:mm:ss')) * 1000) AS requestTime, totalCount]) (1/1). Failure reason: > Checkpoint was declined. > at > > org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:215) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > > org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:156) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > > org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:314) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > > org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointStreamOperator(SubtaskCheckpointCoordinatorImpl.java:614) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > > org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.buildOperatorSnapshotFutures(SubtaskCheckpointCoordinatorImpl.java:540) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > > org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.takeSnapshotSync(SubtaskCheckpointCoordinatorImpl.java:507) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > > org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState(SubtaskCheckpointCoordinatorImpl.java:266) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > > org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$8(StreamTask.java:926) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > > org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:47) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > > org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:916) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > > org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:884) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > ... 13 more > Caused by: org.apache.flink.util.SerializedThrowable: While open a file for > appending: > > /server/yarn/nm/usercache/yarn/appcache/application_1596002540781_1159/flink-io-1ad6bdc6-aea8-4dc5-a133-7c7b5e2361fe/job_1299f2f27e56ec36a4e0ffd3472ad399_op_AggregateWindowOperator_fa157648fdadffa65122f5b4200f4fda__1_1__uuid_9744ef17-bf12-471c-b486-19140201517f/db/038968.sst: > Too many open files > at org.rocksdb.Checkpoint.createCheckpoint(Native Method) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at org.rocksdb.Checkpoint.createCheckpoint(Checkpoint.java:51) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > > org.apache.flink.contrib.streaming.state.snapshot.RocksIncrementalSnapshotStrategy.takeDBNativeCheckpoint(RocksIncrementalSnapshotStrategy.java:255) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > > org.apache.flink.contrib.streaming.state.snapshot.RocksIncrementalSnapshotStrategy.doSnapshot(RocksIncrementalSnapshotStrategy.java:159) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > > org.apache.flink.contrib.streaming.state.snapshot.RocksDBSnapshotStrategyBase.snapshot(RocksDBSnapshotStrategyBase.java:126) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > > org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.snapshot(RocksDBKeyedStateBackend.java:459) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > > org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:198) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > > org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:156) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > > org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:314) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > > org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointStreamOperator(SubtaskCheckpointCoordinatorImpl.java:614) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > > org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.buildOperatorSnapshotFutures(SubtaskCheckpointCoordinatorImpl.java:540) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > > org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.takeSnapshotSync(SubtaskCheckpointCoordinatorImpl.java:507) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > > org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState(SubtaskCheckpointCoordinatorImpl.java:266) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > > org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$8(StreamTask.java:926) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > > org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:47) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > > org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:916) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > at > > org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:884) > ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT] > ... 13 more > 2021-02-01 08:54:52,654 INFO > > org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy > [] - Calculating tasks to restart to recover the failed task > fa157648fdadffa65122f5b4200f4fda_0. > 2021-02-01 08:54:52,654 INFO > > org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy > [] - 7 tasks should be restarted to recover the failed task > fa157648fdadffa65122f5b4200f4fda_0. > 2021-02-01 08:54:52,654 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job > > insert-into_default_catalog.default_database.risk_final_accept_sink,default_catalog.default_database.risk_final_accept_grafana_sink > (1299f2f27e56ec36a4e0ffd3472ad399) switched from state RUNNING to > RESTARTING. > 2021-02-01 08:54:52,654 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > GroupWindowAggregate(window=[SlidingGroupWindow('w$, requestDateTime, > 1800000, 600000)], properties=[w$start, w$end, w$rowtime, w$proctime], > select=[COUNT(DISTINCT merchantReferenceCode) AS acceptCount, start('w$) AS > w$start, end('w$) AS w$end, rowtime('w$) AS w$rowtime, proctime('w$) AS > w$proctime]) -> Calc(select=[_UTF-16LE'risk_final_accept_hop10min30min' AS > eventCode, (w$start DATE_FORMAT _UTF-16LE'yyyy-MM-dd HH:mm:ss') AS > timeStart, (w$end DATE_FORMAT _UTF-16LE'yyyy-MM-dd HH:mm:ss') AS timeEnd, > (UNIX_TIMESTAMP((w$start DATE_FORMAT _UTF-16LE'yyyy-MM-dd HH:mm:ss')) * > 1000) AS requestTime, _UTF-16LE'0' AS userId, acceptCount]) (1/1) > (52f55328f6bf756dd1c63bb0d149e55b) switched from RUNNING to CANCELING. > > > > > -- > Sent from: http://apache-flink.147419.n8.nabble.com/ > |
This post was updated on .
flink版本是1.12-SNAPSHOT,这个版本是1.11发布后不久为了使用1.12一个新特性,新打的一个包,代码应该比较靠近1.11,checkpoint配置是:
pipeline.time-characteristic EventTime execution.checkpointing.interval 600000 execution.checkpointing.min-pause 120000 execution.checkpointing.timeout 120000 execution.checkpointing.externalized-checkpoint-retention RETAIN_ON_CANCELLATION state.backend rocksdb state.backend.incremental true state.checkpoints.dir hdfs:///tmp/flink/checkpoint 完整的jm log很大,1g多,上面贴的是关键的错误信息 -- Sent from: http://apache-flink.147419.n8.nabble.com/ |
Free forum by Nabble | Edit this page |