Hi all,
现在有一个简单的flink任务,大概chain在一起后的执行图为: Source: Custom Source -> Map -> Source_Map -> Empty_Filer -> Field_Filter -> Type_Filter -> Value_Filter -> Map -> Map -> Map -> Sink: Unnamed 但是在上线一段时间后,开始报错,日志中有说到无法完成checkpoint,还提到有kafka的网络和连接异常。但还有别的flink任务在相同的broker上进行数据的读写,并且没有报错。我们暂时定位在,有可能每个checkpoint的完成时间比较长,需要几百毫秒,我们设的时间间隔又比较短,只有一秒,可能是这部分影响到了任务的性能。但是这只是一个不太靠谱的猜想,现在也没有什么排查的切入点,想看看大家有没有一些看法或者建议意见,非常感谢。 部分报错信息如下: 2020-06-10 12:02:49,083 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 1 @ 1591761769060 for job c41f4811262db1c4c270b136571c8201. 2020-06-10 12:04:47,898 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Decline checkpoint 1 by task 0cb03590fdf18027206ef628b3ef5863 of job c41f4811262db1c4c270b136571c8201 at container_e27_1591466310139_21670_01_000006 @ hdp1-hadoop-datanode-4.novalocal (dataPort=44778). 2020-06-10 12:04:47,899 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Discarding checkpoint 1 of job c41f4811262db1c4c270b136571c8201. org.apache.flink.runtime.checkpoint.CheckpointException: Could not complete snapshot 1 for operator Source: Custom Source -> Map -> Source_Map -> Empty_Filer -> Field_Filter -> Type_Filter -> Value_Filter -> Map -> Map -> Map -> Sink: Unnamed (7/12). Failure reason: Checkpoint was declined. at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:434) at org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.checkpointStreamOperator(StreamTask.java:1420) at org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1354) at org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:991) at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$5(StreamTask.java:887) at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94) at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:860) at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:793) at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$3(StreamTask.java:777) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.run(StreamTaskActionExecutor.java:87) at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:78) at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:261) at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:186) at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:487) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:470) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.flink.streaming.connectors.kafka.FlinkKafkaException: Failed to send data to Kafka: The server disconnected before a response was received. at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:1218) at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.flush(FlinkKafkaProducer.java:973) at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.preCommit(FlinkKafkaProducer.java:892) at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.preCommit(FlinkKafkaProducer.java:98) at org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.snapshotState(TwoPhaseCommitSinkFunction.java:317) at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.snapshotState(FlinkKafkaProducer.java:978) at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.trySnapshotFunctionState(StreamingFunctionUtils.java:118) at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.snapshotFunctionState(StreamingFunctionUtils.java:99) at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.snapshotState(AbstractUdfStreamOperator.java:90) at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotStat(AbstractStreamOperator.java:402) ... 18 more Caused by: org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received. 2020-06-10 12:04:47,913 INFO org.apache.flink.runtime.jobmaster.JobMaster - Trying to recover from a global failure. org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable failure threshold. at org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleTaskLevelCheckpointException(CheckpointFailureManager.java:87) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.failPendingCheckpointDueToTaskFailure(CheckpointCoordinator.java:1467) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.discardCheckpoint(CheckpointCoordinator.java:1377) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveDeclineMessage(CheckpointCoordinator.java:719) at org.apache.flink.runtime.scheduler.SchedulerBase.lambda$declineCheckpoint$5(SchedulerBase.java:807) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 期望收到各位的回复和帮助。 Best, Zhefu |
补充一下,在TaskManager发现了如下错误日志:
2020-06-10 12:44:40,688 ERROR org.apache.flink.streaming.runtime.tasks.StreamTask - Error during disposal of stream operator. org.apache.flink.streaming.connectors.kafka.FlinkKafkaException: Failed to send data to Kafka: Pending record count must be zero at this point: 5 at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:1218) at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.close(FlinkKafkaProducer.java:861) at org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:43) at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:117) at org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:668) at org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUpInvoke(StreamTask.java:579) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:481) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.IllegalStateException: Pending record count must be zero at this point: 5 at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.flush(FlinkKafkaProducer.java:969) at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.close(FlinkKafkaProducer.java:834) ... 8 more 希望得到帮助,感谢! Zhefu PENG <[hidden email]> 于2020年6月10日周三 下午1:03写道: > Hi all, > > 现在有一个简单的flink任务,大概chain在一起后的执行图为: > Source: Custom Source -> Map -> Source_Map -> Empty_Filer -> Field_Filter > -> Type_Filter -> Value_Filter -> Map -> Map -> Map -> Sink: Unnamed > > > 但是在上线一段时间后,开始报错,日志中有说到无法完成checkpoint,还提到有kafka的网络和连接异常。但还有别的flink任务在相同的broker上进行数据的读写,并且没有报错。我们暂时定位在,有可能每个checkpoint的完成时间比较长,需要几百毫秒,我们设的时间间隔又比较短,只有一秒,可能是这部分影响到了任务的性能。但是这只是一个不太靠谱的猜想,现在也没有什么排查的切入点,想看看大家有没有一些看法或者建议意见,非常感谢。 > > 部分报错信息如下: > 2020-06-10 12:02:49,083 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 1 @ 1591761769060 for job c41f4811262db1c4c270b136571c8201. > 2020-06-10 12:04:47,898 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Decline > checkpoint 1 by task 0cb03590fdf18027206ef628b3ef5863 of job > c41f4811262db1c4c270b136571c8201 at > container_e27_1591466310139_21670_01_000006 @ > hdp1-hadoop-datanode-4.novalocal (dataPort=44778). > 2020-06-10 12:04:47,899 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Discarding > checkpoint 1 of job c41f4811262db1c4c270b136571c8201. > org.apache.flink.runtime.checkpoint.CheckpointException: Could not > complete snapshot 1 for operator Source: Custom Source -> Map -> Source_Map > -> Empty_Filer -> Field_Filter -> Type_Filter -> Value_Filter -> Map -> Map > -> Map -> Sink: Unnamed (7/12). Failure reason: Checkpoint was declined. > at > org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:434) > at > org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.checkpointStreamOperator(StreamTask.java:1420) > at > org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1354) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:991) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$5(StreamTask.java:887) > at > org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:860) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:793) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$3(StreamTask.java:777) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.run(StreamTaskActionExecutor.java:87) > at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:78) > at > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:261) > at > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:186) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:487) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:470) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532) > at java.lang.Thread.run(Thread.java:748) > Caused by: > org.apache.flink.streaming.connectors.kafka.FlinkKafkaException: Failed to > send data to Kafka: The server disconnected before a response was received. > at > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:1218) > at > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.flush(FlinkKafkaProducer.java:973) > at > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.preCommit(FlinkKafkaProducer.java:892) > at > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.preCommit(FlinkKafkaProducer.java:98) > at > org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.snapshotState(TwoPhaseCommitSinkFunction.java:317) > at > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.snapshotState(FlinkKafkaProducer.java:978) > at > org.apache.flink.streaming.util.functions.StreamingFunctionUtils.trySnapshotFunctionState(StreamingFunctionUtils.java:118) > at > org.apache.flink.streaming.util.functions.StreamingFunctionUtils.snapshotFunctionState(StreamingFunctionUtils.java:99) > at > org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.snapshotState(AbstractUdfStreamOperator.java:90) > at > org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotStat(AbstractStreamOperator.java:402) > ... 18 more > Caused by: org.apache.kafka.common.errors.NetworkException: The server > disconnected before a response was received. > 2020-06-10 12:04:47,913 INFO org.apache.flink.runtime.jobmaster.JobMaster > - Trying to recover from a global failure. > org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable > failure threshold. > at > org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleTaskLevelCheckpointException(CheckpointFailureManager.java:87) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.failPendingCheckpointDueToTaskFailure(CheckpointCoordinator.java:1467) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.discardCheckpoint(CheckpointCoordinator.java:1377) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveDeclineMessage(CheckpointCoordinator.java:719) > at > org.apache.flink.runtime.scheduler.SchedulerBase.lambda$declineCheckpoint$5(SchedulerBase.java:807) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > > 期望收到各位的回复和帮助。 > Best, > Zhefu > |
哈喽,根据我自己遇到checkpoint失败,一般是因为你数据有问题,导致算子失败,有可能是数据格式,或者字段类型不匹配,字段数量等相关的原因造成,我看你补充的内容,好像是你kafka数据有问题样,你可以往这个方向看看数据是否正常。解析是否正确。
> 在 2020年6月10日,下午1:24,Zhefu PENG <[hidden email]> 写道: > > 补充一下,在TaskManager发现了如下错误日志: > > 2020-06-10 12:44:40,688 ERROR > org.apache.flink.streaming.runtime.tasks.StreamTask - Error > during disposal of stream operator. > org.apache.flink.streaming.connectors.kafka.FlinkKafkaException: Failed to > send data to Kafka: Pending record count must be zero at this point: 5 > at > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:1218) > at > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.close(FlinkKafkaProducer.java:861) > at > org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:43) > at > org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:117) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:668) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUpInvoke(StreamTask.java:579) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:481) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.IllegalStateException: Pending record count must be > zero at this point: 5 > at > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.flush(FlinkKafkaProducer.java:969) > at > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.close(FlinkKafkaProducer.java:834) > ... 8 more > > 希望得到帮助,感谢! > > > Zhefu PENG <[hidden email]> 于2020年6月10日周三 下午1:03写道: > >> Hi all, >> >> 现在有一个简单的flink任务,大概chain在一起后的执行图为: >> Source: Custom Source -> Map -> Source_Map -> Empty_Filer -> Field_Filter >> -> Type_Filter -> Value_Filter -> Map -> Map -> Map -> Sink: Unnamed >> >> >> 但是在上线一段时间后,开始报错,日志中有说到无法完成checkpoint,还提到有kafka的网络和连接异常。但还有别的flink任务在相同的broker上进行数据的读写,并且没有报错。我们暂时定位在,有可能每个checkpoint的完成时间比较长,需要几百毫秒,我们设的时间间隔又比较短,只有一秒,可能是这部分影响到了任务的性能。但是这只是一个不太靠谱的猜想,现在也没有什么排查的切入点,想看看大家有没有一些看法或者建议意见,非常感谢。 >> >> 部分报错信息如下: >> 2020-06-10 12:02:49,083 INFO >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering >> checkpoint 1 @ 1591761769060 for job c41f4811262db1c4c270b136571c8201. >> 2020-06-10 12:04:47,898 INFO >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Decline >> checkpoint 1 by task 0cb03590fdf18027206ef628b3ef5863 of job >> c41f4811262db1c4c270b136571c8201 at >> container_e27_1591466310139_21670_01_000006 @ >> hdp1-hadoop-datanode-4.novalocal (dataPort=44778). >> 2020-06-10 12:04:47,899 INFO >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Discarding >> checkpoint 1 of job c41f4811262db1c4c270b136571c8201. >> org.apache.flink.runtime.checkpoint.CheckpointException: Could not >> complete snapshot 1 for operator Source: Custom Source -> Map -> Source_Map >> -> Empty_Filer -> Field_Filter -> Type_Filter -> Value_Filter -> Map -> Map >> -> Map -> Sink: Unnamed (7/12). Failure reason: Checkpoint was declined. >> at >> org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:434) >> at >> org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.checkpointStreamOperator(StreamTask.java:1420) >> at >> org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1354) >> at >> org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:991) >> at >> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$5(StreamTask.java:887) >> at >> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94) >> at >> org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:860) >> at >> org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:793) >> at >> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$3(StreamTask.java:777) >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> at >> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.run(StreamTaskActionExecutor.java:87) >> at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:78) >> at >> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:261) >> at >> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:186) >> at >> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:487) >> at >> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:470) >> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707) >> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532) >> at java.lang.Thread.run(Thread.java:748) >> Caused by: >> org.apache.flink.streaming.connectors.kafka.FlinkKafkaException: Failed to >> send data to Kafka: The server disconnected before a response was received. >> at >> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:1218) >> at >> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.flush(FlinkKafkaProducer.java:973) >> at >> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.preCommit(FlinkKafkaProducer.java:892) >> at >> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.preCommit(FlinkKafkaProducer.java:98) >> at >> org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.snapshotState(TwoPhaseCommitSinkFunction.java:317) >> at >> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.snapshotState(FlinkKafkaProducer.java:978) >> at >> org.apache.flink.streaming.util.functions.StreamingFunctionUtils.trySnapshotFunctionState(StreamingFunctionUtils.java:118) >> at >> org.apache.flink.streaming.util.functions.StreamingFunctionUtils.snapshotFunctionState(StreamingFunctionUtils.java:99) >> at >> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.snapshotState(AbstractUdfStreamOperator.java:90) >> at >> org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotStat(AbstractStreamOperator.java:402) >> ... 18 more >> Caused by: org.apache.kafka.common.errors.NetworkException: The server >> disconnected before a response was received. >> 2020-06-10 12:04:47,913 INFO org.apache.flink.runtime.jobmaster.JobMaster >> - Trying to recover from a global failure. >> org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable >> failure threshold. >> at >> org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleTaskLevelCheckpointException(CheckpointFailureManager.java:87) >> at >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.failPendingCheckpointDueToTaskFailure(CheckpointCoordinator.java:1467) >> at >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.discardCheckpoint(CheckpointCoordinator.java:1377) >> at >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveDeclineMessage(CheckpointCoordinator.java:719) >> at >> org.apache.flink.runtime.scheduler.SchedulerBase.lambda$declineCheckpoint$5(SchedulerBase.java:807) >> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> at >> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) >> at >> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >> at java.lang.Thread.run(Thread.java:748) >> >> 期望收到各位的回复和帮助。 >> Best, >> Zhefu >> > |
In reply to this post by zhefu
Hi
从错误栈看是因为 task 端 snapshot 出问题了,原因是 “Caused by: java.lang.IllegalStateException: Pending record count must be zero at this point: 5”,需要看一下为什么会走到这里 Best, Congxian 李奇 <[hidden email]> 于2020年6月10日周三 下午5:57写道: > > 哈喽,根据我自己遇到checkpoint失败,一般是因为你数据有问题,导致算子失败,有可能是数据格式,或者字段类型不匹配,字段数量等相关的原因造成,我看你补充的内容,好像是你kafka数据有问题样,你可以往这个方向看看数据是否正常。解析是否正确。 > > > 在 2020年6月10日,下午1:24,Zhefu PENG <[hidden email]> 写道: > > > > 补充一下,在TaskManager发现了如下错误日志: > > > > 2020-06-10 12:44:40,688 ERROR > > org.apache.flink.streaming.runtime.tasks.StreamTask - Error > > during disposal of stream operator. > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaException: Failed > to > > send data to Kafka: Pending record count must be zero at this point: 5 > > at > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:1218) > > at > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.close(FlinkKafkaProducer.java:861) > > at > > > org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:43) > > at > > > org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:117) > > at > > > org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:668) > > at > > > org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUpInvoke(StreamTask.java:579) > > at > > > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:481) > > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707) > > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532) > > at java.lang.Thread.run(Thread.java:748) > > Caused by: java.lang.IllegalStateException: Pending record count must be > > zero at this point: 5 > > at > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.flush(FlinkKafkaProducer.java:969) > > at > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.close(FlinkKafkaProducer.java:834) > > ... 8 more > > > > 希望得到帮助,感谢! > > > > > > Zhefu PENG <[hidden email]> 于2020年6月10日周三 下午1:03写道: > > > >> Hi all, > >> > >> 现在有一个简单的flink任务,大概chain在一起后的执行图为: > >> Source: Custom Source -> Map -> Source_Map -> Empty_Filer -> > Field_Filter > >> -> Type_Filter -> Value_Filter -> Map -> Map -> Map -> Sink: Unnamed > >> > >> > >> > 但是在上线一段时间后,开始报错,日志中有说到无法完成checkpoint,还提到有kafka的网络和连接异常。但还有别的flink任务在相同的broker上进行数据的读写,并且没有报错。我们暂时定位在,有可能每个checkpoint的完成时间比较长,需要几百毫秒,我们设的时间间隔又比较短,只有一秒,可能是这部分影响到了任务的性能。但是这只是一个不太靠谱的猜想,现在也没有什么排查的切入点,想看看大家有没有一些看法或者建议意见,非常感谢。 > >> > >> 部分报错信息如下: > >> 2020-06-10 12:02:49,083 INFO > >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - > Triggering > >> checkpoint 1 @ 1591761769060 for job c41f4811262db1c4c270b136571c8201. > >> 2020-06-10 12:04:47,898 INFO > >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Decline > >> checkpoint 1 by task 0cb03590fdf18027206ef628b3ef5863 of job > >> c41f4811262db1c4c270b136571c8201 at > >> container_e27_1591466310139_21670_01_000006 @ > >> hdp1-hadoop-datanode-4.novalocal (dataPort=44778). > >> 2020-06-10 12:04:47,899 INFO > >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - > Discarding > >> checkpoint 1 of job c41f4811262db1c4c270b136571c8201. > >> org.apache.flink.runtime.checkpoint.CheckpointException: Could not > >> complete snapshot 1 for operator Source: Custom Source -> Map -> > Source_Map > >> -> Empty_Filer -> Field_Filter -> Type_Filter -> Value_Filter -> Map -> > Map > >> -> Map -> Sink: Unnamed (7/12). Failure reason: Checkpoint was declined. > >> at > >> > org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:434) > >> at > >> > org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.checkpointStreamOperator(StreamTask.java:1420) > >> at > >> > org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1354) > >> at > >> > org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:991) > >> at > >> > org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$5(StreamTask.java:887) > >> at > >> > org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94) > >> at > >> > org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:860) > >> at > >> > org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:793) > >> at > >> > org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$3(StreamTask.java:777) > >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) > >> at > >> > org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.run(StreamTaskActionExecutor.java:87) > >> at > org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:78) > >> at > >> > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:261) > >> at > >> > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:186) > >> at > >> > org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:487) > >> at > >> > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:470) > >> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707) > >> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532) > >> at java.lang.Thread.run(Thread.java:748) > >> Caused by: > >> org.apache.flink.streaming.connectors.kafka.FlinkKafkaException: Failed > to > >> send data to Kafka: The server disconnected before a response was > received. > >> at > >> > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:1218) > >> at > >> > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.flush(FlinkKafkaProducer.java:973) > >> at > >> > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.preCommit(FlinkKafkaProducer.java:892) > >> at > >> > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.preCommit(FlinkKafkaProducer.java:98) > >> at > >> > org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.snapshotState(TwoPhaseCommitSinkFunction.java:317) > >> at > >> > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.snapshotState(FlinkKafkaProducer.java:978) > >> at > >> > org.apache.flink.streaming.util.functions.StreamingFunctionUtils.trySnapshotFunctionState(StreamingFunctionUtils.java:118) > >> at > >> > org.apache.flink.streaming.util.functions.StreamingFunctionUtils.snapshotFunctionState(StreamingFunctionUtils.java:99) > >> at > >> > org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.snapshotState(AbstractUdfStreamOperator.java:90) > >> at > >> > org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotStat(AbstractStreamOperator.java:402) > >> ... 18 more > >> Caused by: org.apache.kafka.common.errors.NetworkException: The server > >> disconnected before a response was received. > >> 2020-06-10 12:04:47,913 INFO > org.apache.flink.runtime.jobmaster.JobMaster > >> - Trying to recover from a global failure. > >> org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint > tolerable > >> failure threshold. > >> at > >> > org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleTaskLevelCheckpointException(CheckpointFailureManager.java:87) > >> at > >> > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.failPendingCheckpointDueToTaskFailure(CheckpointCoordinator.java:1467) > >> at > >> > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.discardCheckpoint(CheckpointCoordinator.java:1377) > >> at > >> > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveDeclineMessage(CheckpointCoordinator.java:719) > >> at > >> > org.apache.flink.runtime.scheduler.SchedulerBase.lambda$declineCheckpoint$5(SchedulerBase.java:807) > >> at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) > >> at > >> > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > >> at > >> > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > >> at > >> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > >> at > >> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > >> at java.lang.Thread.run(Thread.java:748) > >> > >> 期望收到各位的回复和帮助。 > >> Best, > >> Zhefu > >> > > > > |
Hi ZheFu,
可以把你的 Flink 版本说一下,我大致理解是这样的,每次 sink 端 在 snapshotState 的时候,会检查该次 Sink 的数据是否都已经 Sink 到了 kafka. 也就是说,你这次 Checkpoint 的时候,由于你的 Checkpoint 间隔较短,Kafka 那边给回的消息记录 Ack 还没有弄完,所以有这个问题。建议 Checkpoint 间隔弄长点。 具体代码查看:FlinkKafkaProducerBase.snapshotState 这个方法。 Best, LakeShen Congxian Qiu <[hidden email]> 于2020年6月11日周四 上午9:50写道: > Hi > > 从错误栈看是因为 task 端 snapshot 出问题了,原因是 “Caused by: > java.lang.IllegalStateException: Pending record count must be zero at this > point: 5”,需要看一下为什么会走到这里 > > Best, > Congxian > > > 李奇 <[hidden email]> 于2020年6月10日周三 下午5:57写道: > > > > > > 哈喽,根据我自己遇到checkpoint失败,一般是因为你数据有问题,导致算子失败,有可能是数据格式,或者字段类型不匹配,字段数量等相关的原因造成,我看你补充的内容,好像是你kafka数据有问题样,你可以往这个方向看看数据是否正常。解析是否正确。 > > > > > 在 2020年6月10日,下午1:24,Zhefu PENG <[hidden email]> 写道: > > > > > > 补充一下,在TaskManager发现了如下错误日志: > > > > > > 2020-06-10 12:44:40,688 ERROR > > > org.apache.flink.streaming.runtime.tasks.StreamTask - Error > > > during disposal of stream operator. > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaException: Failed > > to > > > send data to Kafka: Pending record count must be zero at this point: 5 > > > at > > > > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:1218) > > > at > > > > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.close(FlinkKafkaProducer.java:861) > > > at > > > > > > org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:43) > > > at > > > > > > org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:117) > > > at > > > > > > org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:668) > > > at > > > > > > org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUpInvoke(StreamTask.java:579) > > > at > > > > > > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:481) > > > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707) > > > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532) > > > at java.lang.Thread.run(Thread.java:748) > > > Caused by: java.lang.IllegalStateException: Pending record count must > be > > > zero at this point: 5 > > > at > > > > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.flush(FlinkKafkaProducer.java:969) > > > at > > > > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.close(FlinkKafkaProducer.java:834) > > > ... 8 more > > > > > > 希望得到帮助,感谢! > > > > > > > > > Zhefu PENG <[hidden email]> 于2020年6月10日周三 下午1:03写道: > > > > > >> Hi all, > > >> > > >> 现在有一个简单的flink任务,大概chain在一起后的执行图为: > > >> Source: Custom Source -> Map -> Source_Map -> Empty_Filer -> > > Field_Filter > > >> -> Type_Filter -> Value_Filter -> Map -> Map -> Map -> Sink: Unnamed > > >> > > >> > > >> > > > 但是在上线一段时间后,开始报错,日志中有说到无法完成checkpoint,还提到有kafka的网络和连接异常。但还有别的flink任务在相同的broker上进行数据的读写,并且没有报错。我们暂时定位在,有可能每个checkpoint的完成时间比较长,需要几百毫秒,我们设的时间间隔又比较短,只有一秒,可能是这部分影响到了任务的性能。但是这只是一个不太靠谱的猜想,现在也没有什么排查的切入点,想看看大家有没有一些看法或者建议意见,非常感谢。 > > >> > > >> 部分报错信息如下: > > >> 2020-06-10 12:02:49,083 INFO > > >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - > > Triggering > > >> checkpoint 1 @ 1591761769060 for job c41f4811262db1c4c270b136571c8201. > > >> 2020-06-10 12:04:47,898 INFO > > >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - > Decline > > >> checkpoint 1 by task 0cb03590fdf18027206ef628b3ef5863 of job > > >> c41f4811262db1c4c270b136571c8201 at > > >> container_e27_1591466310139_21670_01_000006 @ > > >> hdp1-hadoop-datanode-4.novalocal (dataPort=44778). > > >> 2020-06-10 12:04:47,899 INFO > > >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - > > Discarding > > >> checkpoint 1 of job c41f4811262db1c4c270b136571c8201. > > >> org.apache.flink.runtime.checkpoint.CheckpointException: Could not > > >> complete snapshot 1 for operator Source: Custom Source -> Map -> > > Source_Map > > >> -> Empty_Filer -> Field_Filter -> Type_Filter -> Value_Filter -> Map > -> > > Map > > >> -> Map -> Sink: Unnamed (7/12). Failure reason: Checkpoint was > declined. > > >> at > > >> > > > org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:434) > > >> at > > >> > > > org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.checkpointStreamOperator(StreamTask.java:1420) > > >> at > > >> > > > org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1354) > > >> at > > >> > > > org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:991) > > >> at > > >> > > > org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$5(StreamTask.java:887) > > >> at > > >> > > > org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94) > > >> at > > >> > > > org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:860) > > >> at > > >> > > > org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:793) > > >> at > > >> > > > org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$3(StreamTask.java:777) > > >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) > > >> at > > >> > > > org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.run(StreamTaskActionExecutor.java:87) > > >> at > > org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:78) > > >> at > > >> > > > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:261) > > >> at > > >> > > > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:186) > > >> at > > >> > > > org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:487) > > >> at > > >> > > > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:470) > > >> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707) > > >> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532) > > >> at java.lang.Thread.run(Thread.java:748) > > >> Caused by: > > >> org.apache.flink.streaming.connectors.kafka.FlinkKafkaException: > Failed > > to > > >> send data to Kafka: The server disconnected before a response was > > received. > > >> at > > >> > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:1218) > > >> at > > >> > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.flush(FlinkKafkaProducer.java:973) > > >> at > > >> > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.preCommit(FlinkKafkaProducer.java:892) > > >> at > > >> > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.preCommit(FlinkKafkaProducer.java:98) > > >> at > > >> > > > org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.snapshotState(TwoPhaseCommitSinkFunction.java:317) > > >> at > > >> > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.snapshotState(FlinkKafkaProducer.java:978) > > >> at > > >> > > > org.apache.flink.streaming.util.functions.StreamingFunctionUtils.trySnapshotFunctionState(StreamingFunctionUtils.java:118) > > >> at > > >> > > > org.apache.flink.streaming.util.functions.StreamingFunctionUtils.snapshotFunctionState(StreamingFunctionUtils.java:99) > > >> at > > >> > > > org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.snapshotState(AbstractUdfStreamOperator.java:90) > > >> at > > >> > > > org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotStat(AbstractStreamOperator.java:402) > > >> ... 18 more > > >> Caused by: org.apache.kafka.common.errors.NetworkException: The server > > >> disconnected before a response was received. > > >> 2020-06-10 12:04:47,913 INFO > > org.apache.flink.runtime.jobmaster.JobMaster > > >> - Trying to recover from a global failure. > > >> org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint > > tolerable > > >> failure threshold. > > >> at > > >> > > > org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleTaskLevelCheckpointException(CheckpointFailureManager.java:87) > > >> at > > >> > > > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.failPendingCheckpointDueToTaskFailure(CheckpointCoordinator.java:1467) > > >> at > > >> > > > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.discardCheckpoint(CheckpointCoordinator.java:1377) > > >> at > > >> > > > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveDeclineMessage(CheckpointCoordinator.java:719) > > >> at > > >> > > > org.apache.flink.runtime.scheduler.SchedulerBase.lambda$declineCheckpoint$5(SchedulerBase.java:807) > > >> at > > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > > >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) > > >> at > > >> > > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > > >> at > > >> > > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > > >> at > > >> > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > >> at > > >> > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > >> at java.lang.Thread.run(Thread.java:748) > > >> > > >> 期望收到各位的回复和帮助。 > > >> Best, > > >> Zhefu > > >> > > > > > > > > |
Hi all,
这封邮件最开始发出已经一个月了,这一个月里尝试了很多朋友或者各位大佬的建议,目前经过一周末加上两个工作日的查看,问题看来是解决了。 问题的根本原因:Kafka集群的性能不足(怀疑是CPU负荷过大)。问题出现的时候线上kakfa集群只有七台机器,在排除所有别的原因以及能进行到的尝试方案后,决定进行扩容。扩到15台机器。目前来看,平稳运行,没有再报出类似错误。 反馈一下,如果有朋友遇到类似的问题,可以参考,给这个问题做一个闭环。谢谢各位的关注和帮忙。 Best, Zhefu LakeShen <[hidden email]> 于2020年6月12日周五 上午9:49写道: > Hi ZheFu, > > 可以把你的 Flink 版本说一下,我大致理解是这样的,每次 sink 端 在 snapshotState 的时候,会检查该次 Sink > 的数据是否都已经 Sink 到了 kafka. > > 也就是说,你这次 Checkpoint 的时候,由于你的 Checkpoint 间隔较短,Kafka 那边给回的消息记录 Ack > 还没有弄完,所以有这个问题。建议 Checkpoint 间隔弄长点。 > > 具体代码查看:FlinkKafkaProducerBase.snapshotState 这个方法。 > > Best, > LakeShen > > Congxian Qiu <[hidden email]> 于2020年6月11日周四 上午9:50写道: > > > Hi > > > > 从错误栈看是因为 task 端 snapshot 出问题了,原因是 “Caused by: > > java.lang.IllegalStateException: Pending record count must be zero at > this > > point: 5”,需要看一下为什么会走到这里 > > > > Best, > > Congxian > > > > > > 李奇 <[hidden email]> 于2020年6月10日周三 下午5:57写道: > > > > > > > > > > > 哈喽,根据我自己遇到checkpoint失败,一般是因为你数据有问题,导致算子失败,有可能是数据格式,或者字段类型不匹配,字段数量等相关的原因造成,我看你补充的内容,好像是你kafka数据有问题样,你可以往这个方向看看数据是否正常。解析是否正确。 > > > > > > > 在 2020年6月10日,下午1:24,Zhefu PENG <[hidden email]> 写道: > > > > > > > > 补充一下,在TaskManager发现了如下错误日志: > > > > > > > > 2020-06-10 12:44:40,688 ERROR > > > > org.apache.flink.streaming.runtime.tasks.StreamTask - Error > > > > during disposal of stream operator. > > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaException: > Failed > > > to > > > > send data to Kafka: Pending record count must be zero at this point: > 5 > > > > at > > > > > > > > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:1218) > > > > at > > > > > > > > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.close(FlinkKafkaProducer.java:861) > > > > at > > > > > > > > > > org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:43) > > > > at > > > > > > > > > > org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:117) > > > > at > > > > > > > > > > org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:668) > > > > at > > > > > > > > > > org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUpInvoke(StreamTask.java:579) > > > > at > > > > > > > > > > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:481) > > > > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707) > > > > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532) > > > > at java.lang.Thread.run(Thread.java:748) > > > > Caused by: java.lang.IllegalStateException: Pending record count must > > be > > > > zero at this point: 5 > > > > at > > > > > > > > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.flush(FlinkKafkaProducer.java:969) > > > > at > > > > > > > > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.close(FlinkKafkaProducer.java:834) > > > > ... 8 more > > > > > > > > 希望得到帮助,感谢! > > > > > > > > > > > > Zhefu PENG <[hidden email]> 于2020年6月10日周三 下午1:03写道: > > > > > > > >> Hi all, > > > >> > > > >> 现在有一个简单的flink任务,大概chain在一起后的执行图为: > > > >> Source: Custom Source -> Map -> Source_Map -> Empty_Filer -> > > > Field_Filter > > > >> -> Type_Filter -> Value_Filter -> Map -> Map -> Map -> Sink: Unnamed > > > >> > > > >> > > > >> > > > > > > 但是在上线一段时间后,开始报错,日志中有说到无法完成checkpoint,还提到有kafka的网络和连接异常。但还有别的flink任务在相同的broker上进行数据的读写,并且没有报错。我们暂时定位在,有可能每个checkpoint的完成时间比较长,需要几百毫秒,我们设的时间间隔又比较短,只有一秒,可能是这部分影响到了任务的性能。但是这只是一个不太靠谱的猜想,现在也没有什么排查的切入点,想看看大家有没有一些看法或者建议意见,非常感谢。 > > > >> > > > >> 部分报错信息如下: > > > >> 2020-06-10 12:02:49,083 INFO > > > >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - > > > Triggering > > > >> checkpoint 1 @ 1591761769060 for job > c41f4811262db1c4c270b136571c8201. > > > >> 2020-06-10 12:04:47,898 INFO > > > >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - > > Decline > > > >> checkpoint 1 by task 0cb03590fdf18027206ef628b3ef5863 of job > > > >> c41f4811262db1c4c270b136571c8201 at > > > >> container_e27_1591466310139_21670_01_000006 @ > > > >> hdp1-hadoop-datanode-4.novalocal (dataPort=44778). > > > >> 2020-06-10 12:04:47,899 INFO > > > >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - > > > Discarding > > > >> checkpoint 1 of job c41f4811262db1c4c270b136571c8201. > > > >> org.apache.flink.runtime.checkpoint.CheckpointException: Could not > > > >> complete snapshot 1 for operator Source: Custom Source -> Map -> > > > Source_Map > > > >> -> Empty_Filer -> Field_Filter -> Type_Filter -> Value_Filter -> Map > > -> > > > Map > > > >> -> Map -> Sink: Unnamed (7/12). Failure reason: Checkpoint was > > declined. > > > >> at > > > >> > > > > > > org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:434) > > > >> at > > > >> > > > > > > org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.checkpointStreamOperator(StreamTask.java:1420) > > > >> at > > > >> > > > > > > org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1354) > > > >> at > > > >> > > > > > > org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:991) > > > >> at > > > >> > > > > > > org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$5(StreamTask.java:887) > > > >> at > > > >> > > > > > > org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94) > > > >> at > > > >> > > > > > > org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:860) > > > >> at > > > >> > > > > > > org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:793) > > > >> at > > > >> > > > > > > org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$3(StreamTask.java:777) > > > >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) > > > >> at > > > >> > > > > > > org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.run(StreamTaskActionExecutor.java:87) > > > >> at > > > org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:78) > > > >> at > > > >> > > > > > > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:261) > > > >> at > > > >> > > > > > > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:186) > > > >> at > > > >> > > > > > > org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:487) > > > >> at > > > >> > > > > > > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:470) > > > >> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707) > > > >> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532) > > > >> at java.lang.Thread.run(Thread.java:748) > > > >> Caused by: > > > >> org.apache.flink.streaming.connectors.kafka.FlinkKafkaException: > > Failed > > > to > > > >> send data to Kafka: The server disconnected before a response was > > > received. > > > >> at > > > >> > > > > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:1218) > > > >> at > > > >> > > > > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.flush(FlinkKafkaProducer.java:973) > > > >> at > > > >> > > > > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.preCommit(FlinkKafkaProducer.java:892) > > > >> at > > > >> > > > > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.preCommit(FlinkKafkaProducer.java:98) > > > >> at > > > >> > > > > > > org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.snapshotState(TwoPhaseCommitSinkFunction.java:317) > > > >> at > > > >> > > > > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.snapshotState(FlinkKafkaProducer.java:978) > > > >> at > > > >> > > > > > > org.apache.flink.streaming.util.functions.StreamingFunctionUtils.trySnapshotFunctionState(StreamingFunctionUtils.java:118) > > > >> at > > > >> > > > > > > org.apache.flink.streaming.util.functions.StreamingFunctionUtils.snapshotFunctionState(StreamingFunctionUtils.java:99) > > > >> at > > > >> > > > > > > org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.snapshotState(AbstractUdfStreamOperator.java:90) > > > >> at > > > >> > > > > > > org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotStat(AbstractStreamOperator.java:402) > > > >> ... 18 more > > > >> Caused by: org.apache.kafka.common.errors.NetworkException: The > server > > > >> disconnected before a response was received. > > > >> 2020-06-10 12:04:47,913 INFO > > > org.apache.flink.runtime.jobmaster.JobMaster > > > >> - Trying to recover from a global failure. > > > >> org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint > > > tolerable > > > >> failure threshold. > > > >> at > > > >> > > > > > > org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleTaskLevelCheckpointException(CheckpointFailureManager.java:87) > > > >> at > > > >> > > > > > > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.failPendingCheckpointDueToTaskFailure(CheckpointCoordinator.java:1467) > > > >> at > > > >> > > > > > > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.discardCheckpoint(CheckpointCoordinator.java:1377) > > > >> at > > > >> > > > > > > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveDeclineMessage(CheckpointCoordinator.java:719) > > > >> at > > > >> > > > > > > org.apache.flink.runtime.scheduler.SchedulerBase.lambda$declineCheckpoint$5(SchedulerBase.java:807) > > > >> at > > > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > > > >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) > > > >> at > > > >> > > > > > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > > > >> at > > > >> > > > > > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > > > >> at > > > >> > > > > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > > >> at > > > >> > > > > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > > >> at java.lang.Thread.run(Thread.java:748) > > > >> > > > >> 期望收到各位的回复和帮助。 > > > >> Best, > > > >> Zhefu > > > >> > > > > > > > > > > > > > |
> 反馈一下,如果有朋友遇到类似的问题,可以参考,给这个问题做一个闭环。谢谢各位的关注和帮忙。 > > Best, > Zhefu 谢谢 zhefu, 给你大大点赞,很社区的方式,相信这样的积累越多,小伙伴们都能学习到更多。 祝好, Leonard Xu > > LakeShen <[hidden email]> 于2020年6月12日周五 上午9:49写道: > >> Hi ZheFu, >> >> 可以把你的 Flink 版本说一下,我大致理解是这样的,每次 sink 端 在 snapshotState 的时候,会检查该次 Sink >> 的数据是否都已经 Sink 到了 kafka. >> >> 也就是说,你这次 Checkpoint 的时候,由于你的 Checkpoint 间隔较短,Kafka 那边给回的消息记录 Ack >> 还没有弄完,所以有这个问题。建议 Checkpoint 间隔弄长点。 >> >> 具体代码查看:FlinkKafkaProducerBase.snapshotState 这个方法。 >> >> Best, >> LakeShen >> >> Congxian Qiu <[hidden email]> 于2020年6月11日周四 上午9:50写道: >> >>> Hi >>> >>> 从错误栈看是因为 task 端 snapshot 出问题了,原因是 “Caused by: >>> java.lang.IllegalStateException: Pending record count must be zero at >> this >>> point: 5”,需要看一下为什么会走到这里 >>> >>> Best, >>> Congxian >>> >>> >>> 李奇 <[hidden email]> 于2020年6月10日周三 下午5:57写道: >>> >>>> >>>> >>> >> 哈喽,根据我自己遇到checkpoint失败,一般是因为你数据有问题,导致算子失败,有可能是数据格式,或者字段类型不匹配,字段数量等相关的原因造成,我看你补充的内容,好像是你kafka数据有问题样,你可以往这个方向看看数据是否正常。解析是否正确。 >>>> >>>>> 在 2020年6月10日,下午1:24,Zhefu PENG <[hidden email]> 写道: >>>>> >>>>> 补充一下,在TaskManager发现了如下错误日志: >>>>> >>>>> 2020-06-10 12:44:40,688 ERROR >>>>> org.apache.flink.streaming.runtime.tasks.StreamTask - Error >>>>> during disposal of stream operator. >>>>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaException: >> Failed >>>> to >>>>> send data to Kafka: Pending record count must be zero at this point: >> 5 >>>>> at >>>>> >>>> >>> >> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:1218) >>>>> at >>>>> >>>> >>> >> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.close(FlinkKafkaProducer.java:861) >>>>> at >>>>> >>>> >>> >> org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:43) >>>>> at >>>>> >>>> >>> >> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:117) >>>>> at >>>>> >>>> >>> >> org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:668) >>>>> at >>>>> >>>> >>> >> org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUpInvoke(StreamTask.java:579) >>>>> at >>>>> >>>> >>> >> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:481) >>>>> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707) >>>>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532) >>>>> at java.lang.Thread.run(Thread.java:748) >>>>> Caused by: java.lang.IllegalStateException: Pending record count must >>> be >>>>> zero at this point: 5 >>>>> at >>>>> >>>> >>> >> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.flush(FlinkKafkaProducer.java:969) >>>>> at >>>>> >>>> >>> >> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.close(FlinkKafkaProducer.java:834) >>>>> ... 8 more >>>>> >>>>> 希望得到帮助,感谢! >>>>> >>>>> >>>>> Zhefu PENG <[hidden email]> 于2020年6月10日周三 下午1:03写道: >>>>> >>>>>> Hi all, >>>>>> >>>>>> 现在有一个简单的flink任务,大概chain在一起后的执行图为: >>>>>> Source: Custom Source -> Map -> Source_Map -> Empty_Filer -> >>>> Field_Filter >>>>>> -> Type_Filter -> Value_Filter -> Map -> Map -> Map -> Sink: Unnamed >>>>>> >>>>>> >>>>>> >>>> >>> >> 但是在上线一段时间后,开始报错,日志中有说到无法完成checkpoint,还提到有kafka的网络和连接异常。但还有别的flink任务在相同的broker上进行数据的读写,并且没有报错。我们暂时定位在,有可能每个checkpoint的完成时间比较长,需要几百毫秒,我们设的时间间隔又比较短,只有一秒,可能是这部分影响到了任务的性能。但是这只是一个不太靠谱的猜想,现在也没有什么排查的切入点,想看看大家有没有一些看法或者建议意见,非常感谢。 >>>>>> >>>>>> 部分报错信息如下: >>>>>> 2020-06-10 12:02:49,083 INFO >>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - >>>> Triggering >>>>>> checkpoint 1 @ 1591761769060 for job >> c41f4811262db1c4c270b136571c8201. >>>>>> 2020-06-10 12:04:47,898 INFO >>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - >>> Decline >>>>>> checkpoint 1 by task 0cb03590fdf18027206ef628b3ef5863 of job >>>>>> c41f4811262db1c4c270b136571c8201 at >>>>>> container_e27_1591466310139_21670_01_000006 @ >>>>>> hdp1-hadoop-datanode-4.novalocal (dataPort=44778). >>>>>> 2020-06-10 12:04:47,899 INFO >>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - >>>> Discarding >>>>>> checkpoint 1 of job c41f4811262db1c4c270b136571c8201. >>>>>> org.apache.flink.runtime.checkpoint.CheckpointException: Could not >>>>>> complete snapshot 1 for operator Source: Custom Source -> Map -> >>>> Source_Map >>>>>> -> Empty_Filer -> Field_Filter -> Type_Filter -> Value_Filter -> Map >>> -> >>>> Map >>>>>> -> Map -> Sink: Unnamed (7/12). Failure reason: Checkpoint was >>> declined. >>>>>> at >>>>>> >>>> >>> >> org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:434) >>>>>> at >>>>>> >>>> >>> >> org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.checkpointStreamOperator(StreamTask.java:1420) >>>>>> at >>>>>> >>>> >>> >> org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1354) >>>>>> at >>>>>> >>>> >>> >> org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:991) >>>>>> at >>>>>> >>>> >>> >> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$5(StreamTask.java:887) >>>>>> at >>>>>> >>>> >>> >> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94) >>>>>> at >>>>>> >>>> >>> >> org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:860) >>>>>> at >>>>>> >>>> >>> >> org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:793) >>>>>> at >>>>>> >>>> >>> >> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$3(StreamTask.java:777) >>>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >>>>>> at >>>>>> >>>> >>> >> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.run(StreamTaskActionExecutor.java:87) >>>>>> at >>>> org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:78) >>>>>> at >>>>>> >>>> >>> >> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:261) >>>>>> at >>>>>> >>>> >>> >> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:186) >>>>>> at >>>>>> >>>> >>> >> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:487) >>>>>> at >>>>>> >>>> >>> >> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:470) >>>>>> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707) >>>>>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532) >>>>>> at java.lang.Thread.run(Thread.java:748) >>>>>> Caused by: >>>>>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaException: >>> Failed >>>> to >>>>>> send data to Kafka: The server disconnected before a response was >>>> received. >>>>>> at >>>>>> >>>> >>> >> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:1218) >>>>>> at >>>>>> >>>> >>> >> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.flush(FlinkKafkaProducer.java:973) >>>>>> at >>>>>> >>>> >>> >> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.preCommit(FlinkKafkaProducer.java:892) >>>>>> at >>>>>> >>>> >>> >> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.preCommit(FlinkKafkaProducer.java:98) >>>>>> at >>>>>> >>>> >>> >> org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.snapshotState(TwoPhaseCommitSinkFunction.java:317) >>>>>> at >>>>>> >>>> >>> >> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.snapshotState(FlinkKafkaProducer.java:978) >>>>>> at >>>>>> >>>> >>> >> org.apache.flink.streaming.util.functions.StreamingFunctionUtils.trySnapshotFunctionState(StreamingFunctionUtils.java:118) >>>>>> at >>>>>> >>>> >>> >> org.apache.flink.streaming.util.functions.StreamingFunctionUtils.snapshotFunctionState(StreamingFunctionUtils.java:99) >>>>>> at >>>>>> >>>> >>> >> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.snapshotState(AbstractUdfStreamOperator.java:90) >>>>>> at >>>>>> >>>> >>> >> org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotStat(AbstractStreamOperator.java:402) >>>>>> ... 18 more >>>>>> Caused by: org.apache.kafka.common.errors.NetworkException: The >> server >>>>>> disconnected before a response was received. >>>>>> 2020-06-10 12:04:47,913 INFO >>>> org.apache.flink.runtime.jobmaster.JobMaster >>>>>> - Trying to recover from a global failure. >>>>>> org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint >>>> tolerable >>>>>> failure threshold. >>>>>> at >>>>>> >>>> >>> >> org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleTaskLevelCheckpointException(CheckpointFailureManager.java:87) >>>>>> at >>>>>> >>>> >>> >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.failPendingCheckpointDueToTaskFailure(CheckpointCoordinator.java:1467) >>>>>> at >>>>>> >>>> >>> >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.discardCheckpoint(CheckpointCoordinator.java:1377) >>>>>> at >>>>>> >>>> >>> >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveDeclineMessage(CheckpointCoordinator.java:719) >>>>>> at >>>>>> >>>> >>> >> org.apache.flink.runtime.scheduler.SchedulerBase.lambda$declineCheckpoint$5(SchedulerBase.java:807) >>>>>> at >>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) >>>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >>>>>> at >>>>>> >>>> >>> >> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) >>>>>> at >>>>>> >>>> >>> >> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) >>>>>> at >>>>>> >>>> >>> >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >>>>>> at >>>>>> >>>> >>> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >>>>>> at java.lang.Thread.run(Thread.java:748) >>>>>> >>>>>> 期望收到各位的回复和帮助。 >>>>>> Best, >>>>>> Zhefu >>>>>> >>>>> >>>> >>>> >>> >> |
In reply to this post by zhefu
Hi Zhefu
感谢你在邮件列表分享你的解决方法,这样其他人遇到类似问题也有一个参考。 Best, Congxian Zhefu PENG <[hidden email]> 于2020年7月13日周一 下午7:51写道: > Hi all, > > 这封邮件最开始发出已经一个月了,这一个月里尝试了很多朋友或者各位大佬的建议,目前经过一周末加上两个工作日的查看,问题看来是解决了。 > > > 问题的根本原因:Kafka集群的性能不足(怀疑是CPU负荷过大)。问题出现的时候线上kakfa集群只有七台机器,在排除所有别的原因以及能进行到的尝试方案后,决定进行扩容。扩到15台机器。目前来看,平稳运行,没有再报出类似错误。 > > 反馈一下,如果有朋友遇到类似的问题,可以参考,给这个问题做一个闭环。谢谢各位的关注和帮忙。 > > Best, > Zhefu > > LakeShen <[hidden email]> 于2020年6月12日周五 上午9:49写道: > > > Hi ZheFu, > > > > 可以把你的 Flink 版本说一下,我大致理解是这样的,每次 sink 端 在 snapshotState 的时候,会检查该次 Sink > > 的数据是否都已经 Sink 到了 kafka. > > > > 也就是说,你这次 Checkpoint 的时候,由于你的 Checkpoint 间隔较短,Kafka 那边给回的消息记录 Ack > > 还没有弄完,所以有这个问题。建议 Checkpoint 间隔弄长点。 > > > > 具体代码查看:FlinkKafkaProducerBase.snapshotState 这个方法。 > > > > Best, > > LakeShen > > > > Congxian Qiu <[hidden email]> 于2020年6月11日周四 上午9:50写道: > > > > > Hi > > > > > > 从错误栈看是因为 task 端 snapshot 出问题了,原因是 “Caused by: > > > java.lang.IllegalStateException: Pending record count must be zero at > > this > > > point: 5”,需要看一下为什么会走到这里 > > > > > > Best, > > > Congxian > > > > > > > > > 李奇 <[hidden email]> 于2020年6月10日周三 下午5:57写道: > > > > > > > > > > > > > > > > > 哈喽,根据我自己遇到checkpoint失败,一般是因为你数据有问题,导致算子失败,有可能是数据格式,或者字段类型不匹配,字段数量等相关的原因造成,我看你补充的内容,好像是你kafka数据有问题样,你可以往这个方向看看数据是否正常。解析是否正确。 > > > > > > > > > 在 2020年6月10日,下午1:24,Zhefu PENG <[hidden email]> 写道: > > > > > > > > > > 补充一下,在TaskManager发现了如下错误日志: > > > > > > > > > > 2020-06-10 12:44:40,688 ERROR > > > > > org.apache.flink.streaming.runtime.tasks.StreamTask - > Error > > > > > during disposal of stream operator. > > > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaException: > > Failed > > > > to > > > > > send data to Kafka: Pending record count must be zero at this > point: > > 5 > > > > > at > > > > > > > > > > > > > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:1218) > > > > > at > > > > > > > > > > > > > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.close(FlinkKafkaProducer.java:861) > > > > > at > > > > > > > > > > > > > > > org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:43) > > > > > at > > > > > > > > > > > > > > > org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:117) > > > > > at > > > > > > > > > > > > > > > org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:668) > > > > > at > > > > > > > > > > > > > > > org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUpInvoke(StreamTask.java:579) > > > > > at > > > > > > > > > > > > > > > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:481) > > > > > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707) > > > > > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532) > > > > > at java.lang.Thread.run(Thread.java:748) > > > > > Caused by: java.lang.IllegalStateException: Pending record count > must > > > be > > > > > zero at this point: 5 > > > > > at > > > > > > > > > > > > > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.flush(FlinkKafkaProducer.java:969) > > > > > at > > > > > > > > > > > > > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.close(FlinkKafkaProducer.java:834) > > > > > ... 8 more > > > > > > > > > > 希望得到帮助,感谢! > > > > > > > > > > > > > > > Zhefu PENG <[hidden email]> 于2020年6月10日周三 下午1:03写道: > > > > > > > > > >> Hi all, > > > > >> > > > > >> 现在有一个简单的flink任务,大概chain在一起后的执行图为: > > > > >> Source: Custom Source -> Map -> Source_Map -> Empty_Filer -> > > > > Field_Filter > > > > >> -> Type_Filter -> Value_Filter -> Map -> Map -> Map -> Sink: > Unnamed > > > > >> > > > > >> > > > > >> > > > > > > > > > > 但是在上线一段时间后,开始报错,日志中有说到无法完成checkpoint,还提到有kafka的网络和连接异常。但还有别的flink任务在相同的broker上进行数据的读写,并且没有报错。我们暂时定位在,有可能每个checkpoint的完成时间比较长,需要几百毫秒,我们设的时间间隔又比较短,只有一秒,可能是这部分影响到了任务的性能。但是这只是一个不太靠谱的猜想,现在也没有什么排查的切入点,想看看大家有没有一些看法或者建议意见,非常感谢。 > > > > >> > > > > >> 部分报错信息如下: > > > > >> 2020-06-10 12:02:49,083 INFO > > > > >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - > > > > Triggering > > > > >> checkpoint 1 @ 1591761769060 for job > > c41f4811262db1c4c270b136571c8201. > > > > >> 2020-06-10 12:04:47,898 INFO > > > > >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - > > > Decline > > > > >> checkpoint 1 by task 0cb03590fdf18027206ef628b3ef5863 of job > > > > >> c41f4811262db1c4c270b136571c8201 at > > > > >> container_e27_1591466310139_21670_01_000006 @ > > > > >> hdp1-hadoop-datanode-4.novalocal (dataPort=44778). > > > > >> 2020-06-10 12:04:47,899 INFO > > > > >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - > > > > Discarding > > > > >> checkpoint 1 of job c41f4811262db1c4c270b136571c8201. > > > > >> org.apache.flink.runtime.checkpoint.CheckpointException: Could not > > > > >> complete snapshot 1 for operator Source: Custom Source -> Map -> > > > > Source_Map > > > > >> -> Empty_Filer -> Field_Filter -> Type_Filter -> Value_Filter -> > Map > > > -> > > > > Map > > > > >> -> Map -> Sink: Unnamed (7/12). Failure reason: Checkpoint was > > > declined. > > > > >> at > > > > >> > > > > > > > > > > org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:434) > > > > >> at > > > > >> > > > > > > > > > > org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.checkpointStreamOperator(StreamTask.java:1420) > > > > >> at > > > > >> > > > > > > > > > > org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1354) > > > > >> at > > > > >> > > > > > > > > > > org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:991) > > > > >> at > > > > >> > > > > > > > > > > org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$5(StreamTask.java:887) > > > > >> at > > > > >> > > > > > > > > > > org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94) > > > > >> at > > > > >> > > > > > > > > > > org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:860) > > > > >> at > > > > >> > > > > > > > > > > org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:793) > > > > >> at > > > > >> > > > > > > > > > > org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$3(StreamTask.java:777) > > > > >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) > > > > >> at > > > > >> > > > > > > > > > > org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.run(StreamTaskActionExecutor.java:87) > > > > >> at > > > > > org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:78) > > > > >> at > > > > >> > > > > > > > > > > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:261) > > > > >> at > > > > >> > > > > > > > > > > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:186) > > > > >> at > > > > >> > > > > > > > > > > org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:487) > > > > >> at > > > > >> > > > > > > > > > > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:470) > > > > >> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707) > > > > >> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532) > > > > >> at java.lang.Thread.run(Thread.java:748) > > > > >> Caused by: > > > > >> org.apache.flink.streaming.connectors.kafka.FlinkKafkaException: > > > Failed > > > > to > > > > >> send data to Kafka: The server disconnected before a response was > > > > received. > > > > >> at > > > > >> > > > > > > > > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:1218) > > > > >> at > > > > >> > > > > > > > > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.flush(FlinkKafkaProducer.java:973) > > > > >> at > > > > >> > > > > > > > > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.preCommit(FlinkKafkaProducer.java:892) > > > > >> at > > > > >> > > > > > > > > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.preCommit(FlinkKafkaProducer.java:98) > > > > >> at > > > > >> > > > > > > > > > > org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.snapshotState(TwoPhaseCommitSinkFunction.java:317) > > > > >> at > > > > >> > > > > > > > > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.snapshotState(FlinkKafkaProducer.java:978) > > > > >> at > > > > >> > > > > > > > > > > org.apache.flink.streaming.util.functions.StreamingFunctionUtils.trySnapshotFunctionState(StreamingFunctionUtils.java:118) > > > > >> at > > > > >> > > > > > > > > > > org.apache.flink.streaming.util.functions.StreamingFunctionUtils.snapshotFunctionState(StreamingFunctionUtils.java:99) > > > > >> at > > > > >> > > > > > > > > > > org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.snapshotState(AbstractUdfStreamOperator.java:90) > > > > >> at > > > > >> > > > > > > > > > > org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotStat(AbstractStreamOperator.java:402) > > > > >> ... 18 more > > > > >> Caused by: org.apache.kafka.common.errors.NetworkException: The > > server > > > > >> disconnected before a response was received. > > > > >> 2020-06-10 12:04:47,913 INFO > > > > org.apache.flink.runtime.jobmaster.JobMaster > > > > >> - Trying to recover from a global failure. > > > > >> org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint > > > > tolerable > > > > >> failure threshold. > > > > >> at > > > > >> > > > > > > > > > > org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleTaskLevelCheckpointException(CheckpointFailureManager.java:87) > > > > >> at > > > > >> > > > > > > > > > > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.failPendingCheckpointDueToTaskFailure(CheckpointCoordinator.java:1467) > > > > >> at > > > > >> > > > > > > > > > > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.discardCheckpoint(CheckpointCoordinator.java:1377) > > > > >> at > > > > >> > > > > > > > > > > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveDeclineMessage(CheckpointCoordinator.java:719) > > > > >> at > > > > >> > > > > > > > > > > org.apache.flink.runtime.scheduler.SchedulerBase.lambda$declineCheckpoint$5(SchedulerBase.java:807) > > > > >> at > > > > > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > > > > >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) > > > > >> at > > > > >> > > > > > > > > > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > > > > >> at > > > > >> > > > > > > > > > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > > > > >> at > > > > >> > > > > > > > > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > > > >> at > > > > >> > > > > > > > > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > > > >> at java.lang.Thread.run(Thread.java:748) > > > > >> > > > > >> 期望收到各位的回复和帮助。 > > > > >> Best, > > > > >> Zhefu > > > > >> > > > > > > > > > > > > > > > > > > > |
Free forum by Nabble | Edit this page |