2020-11-18 16:51:37
org.apache.flink.runtime.io.network.partition.PartitionNotFoundException: Partition b225fa9143dfa179d3a3bd223165d5c5#3@3fee4d51f5a43001ef743f3f15e4cfb2 not found. at org.apache.flink.runtime.io.network.partition.consumer. RemoteInputChannel.failPartitionRequest(RemoteInputChannel.java:267) at org.apache.flink.runtime.io.network.partition.consumer. RemoteInputChannel.retriggerSubpartitionRequest(RemoteInputChannel.java:166) at org.apache.flink.runtime.io.network.partition.consumer. SingleInputGate.retriggerPartitionRequest(SingleInputGate.java:521) at org.apache.flink.runtime.io.network.partition.consumer. SingleInputGate.lambda$triggerPartitionStateCheck$1(SingleInputGate.java:765 ) at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture .java:670) at java.util.concurrent.CompletableFuture$UniAccept.tryFire( CompletableFuture.java:646) at java.util.concurrent.CompletableFuture$Completion.run( CompletableFuture.java:456) at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec( ForkJoinExecutorConfigurator.scala:44) at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool .java:1339) at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread .java:107) 请问这是什么问题呢? |
感觉还有其它 root cause,可以看下还有其它日志不?
Best, Hailong At 2020-11-18 15:52:57, "赵一旦" <[hidden email]> wrote: >2020-11-18 16:51:37 >org.apache.flink.runtime.io.network.partition.PartitionNotFoundException: >Partition b225fa9143dfa179d3a3bd223165d5c5#3@3fee4d51f5a43001ef743f3f15e4cfb2 >not found. > at org.apache.flink.runtime.io.network.partition.consumer. >RemoteInputChannel.failPartitionRequest(RemoteInputChannel.java:267) > at org.apache.flink.runtime.io.network.partition.consumer. >RemoteInputChannel.retriggerSubpartitionRequest(RemoteInputChannel.java:166) > at org.apache.flink.runtime.io.network.partition.consumer. >SingleInputGate.retriggerPartitionRequest(SingleInputGate.java:521) > at org.apache.flink.runtime.io.network.partition.consumer. >SingleInputGate.lambda$triggerPartitionStateCheck$1(SingleInputGate.java:765 >) > at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture >.java:670) > at java.util.concurrent.CompletableFuture$UniAccept.tryFire( >CompletableFuture.java:646) > at java.util.concurrent.CompletableFuture$Completion.run( >CompletableFuture.java:456) > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) > at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec( >ForkJoinExecutorConfigurator.scala:44) > at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool >.java:1339) > at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread >.java:107) > > >请问这是什么问题呢? |
是不是有 kafka 机器挂了?
Best zhisheng hailongwang <[hidden email]> 于2020年11月18日周三 下午5:56写道: > 感觉还有其它 root cause,可以看下还有其它日志不? > > > Best, > Hailong > > At 2020-11-18 15:52:57, "赵一旦" <[hidden email]> wrote: > >2020-11-18 16:51:37 > >org.apache.flink.runtime.io.network.partition.PartitionNotFoundException: > >Partition > b225fa9143dfa179d3a3bd223165d5c5#3@3fee4d51f5a43001ef743f3f15e4cfb2 > >not found. > > at org.apache.flink.runtime.io.network.partition.consumer. > >RemoteInputChannel.failPartitionRequest(RemoteInputChannel.java:267) > > at org.apache.flink.runtime.io.network.partition.consumer. > > >RemoteInputChannel.retriggerSubpartitionRequest(RemoteInputChannel.java:166) > > at org.apache.flink.runtime.io.network.partition.consumer. > >SingleInputGate.retriggerPartitionRequest(SingleInputGate.java:521) > > at org.apache.flink.runtime.io.network.partition.consumer. > > >SingleInputGate.lambda$triggerPartitionStateCheck$1(SingleInputGate.java:765 > >) > > at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture > >.java:670) > > at java.util.concurrent.CompletableFuture$UniAccept.tryFire( > >CompletableFuture.java:646) > > at java.util.concurrent.CompletableFuture$Completion.run( > >CompletableFuture.java:456) > > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) > > at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec( > >ForkJoinExecutorConfigurator.scala:44) > > at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > > at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool > >.java:1339) > > at > akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > > at > akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread > >.java:107) > > > > > >请问这是什么问题呢? > |
这个报错和kafka没有关系的哈,我大概理解是提交任务的瞬间,jobManager/taskManager机器压力较大,存在机器之间心跳超时什么的?
这个partition应该是指flink运行图中的数据partition,我感觉。没有具体细看,每次提交的瞬间可能遇到这个问题,然后会自动重试成功。 zhisheng <[hidden email]> 于2020年11月18日周三 下午10:51写道: > 是不是有 kafka 机器挂了? > > Best > zhisheng > > hailongwang <[hidden email]> 于2020年11月18日周三 下午5:56写道: > > > 感觉还有其它 root cause,可以看下还有其它日志不? > > > > > > Best, > > Hailong > > > > At 2020-11-18 15:52:57, "赵一旦" <[hidden email]> wrote: > > >2020-11-18 16:51:37 > > >org.apache.flink.runtime.io > .network.partition.PartitionNotFoundException: > > >Partition > > b225fa9143dfa179d3a3bd223165d5c5#3@3fee4d51f5a43001ef743f3f15e4cfb2 > > >not found. > > > at org.apache.flink.runtime.io.network.partition.consumer. > > >RemoteInputChannel.failPartitionRequest(RemoteInputChannel.java:267) > > > at org.apache.flink.runtime.io.network.partition.consumer. > > > > > >RemoteInputChannel.retriggerSubpartitionRequest(RemoteInputChannel.java:166) > > > at org.apache.flink.runtime.io.network.partition.consumer. > > >SingleInputGate.retriggerPartitionRequest(SingleInputGate.java:521) > > > at org.apache.flink.runtime.io.network.partition.consumer. > > > > > >SingleInputGate.lambda$triggerPartitionStateCheck$1(SingleInputGate.java:765 > > >) > > > at > java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture > > >.java:670) > > > at java.util.concurrent.CompletableFuture$UniAccept.tryFire( > > >CompletableFuture.java:646) > > > at java.util.concurrent.CompletableFuture$Completion.run( > > >CompletableFuture.java:456) > > > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) > > > at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec( > > >ForkJoinExecutorConfigurator.scala:44) > > > at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > > > at > akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool > > >.java:1339) > > > at > > akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > > > at > > akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread > > >.java:107) > > > > > > > > >请问这是什么问题呢? > > > |
Hi
集群负载比较大的时候,下游一直收不到request的partition,就会导致PartitionNotFoundException,建议增大 taskmanager.network.request-backoff.max [1][2] 以增大重试次数 [1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#taskmanager-network-request-backoff-max [2] https://juejin.cn/post/6844904185347964942#heading-8 祝好 唐云 ________________________________ From: 赵一旦 <[hidden email]> Sent: Monday, November 23, 2020 13:08 To: [hidden email] <[hidden email]> Subject: Re: Flink任务启动偶尔报错PartitionNotFoundException,会自动恢复。 这个报错和kafka没有关系的哈,我大概理解是提交任务的瞬间,jobManager/taskManager机器压力较大,存在机器之间心跳超时什么的? 这个partition应该是指flink运行图中的数据partition,我感觉。没有具体细看,每次提交的瞬间可能遇到这个问题,然后会自动重试成功。 zhisheng <[hidden email]> 于2020年11月18日周三 下午10:51写道: > 是不是有 kafka 机器挂了? > > Best > zhisheng > > hailongwang <[hidden email]> 于2020年11月18日周三 下午5:56写道: > > > 感觉还有其它 root cause,可以看下还有其它日志不? > > > > > > Best, > > Hailong > > > > At 2020-11-18 15:52:57, "赵一旦" <[hidden email]> wrote: > > >2020-11-18 16:51:37 > > >org.apache.flink.runtime.io > .network.partition.PartitionNotFoundException: > > >Partition > > b225fa9143dfa179d3a3bd223165d5c5#3@3fee4d51f5a43001ef743f3f15e4cfb2 > > >not found. > > > at org.apache.flink.runtime.io.network.partition.consumer. > > >RemoteInputChannel.failPartitionRequest(RemoteInputChannel.java:267) > > > at org.apache.flink.runtime.io.network.partition.consumer. > > > > > >RemoteInputChannel.retriggerSubpartitionRequest(RemoteInputChannel.java:166) > > > at org.apache.flink.runtime.io.network.partition.consumer. > > >SingleInputGate.retriggerPartitionRequest(SingleInputGate.java:521) > > > at org.apache.flink.runtime.io.network.partition.consumer. > > > > > >SingleInputGate.lambda$triggerPartitionStateCheck$1(SingleInputGate.java:765 > > >) > > > at > java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture > > >.java:670) > > > at java.util.concurrent.CompletableFuture$UniAccept.tryFire( > > >CompletableFuture.java:646) > > > at java.util.concurrent.CompletableFuture$Completion.run( > > >CompletableFuture.java:456) > > > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) > > > at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec( > > >ForkJoinExecutorConfigurator.scala:44) > > > at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > > > at > akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool > > >.java:1339) > > > at > > akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > > > at > > akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread > > >.java:107) > > > > > > > > >请问这是什么问题呢? > > > |
@Yun Tang,应该是这个问题。
请教下这几个参数具体含义。 backoff in milliseconds for partition requests of input channels 是什么逻辑,以及initial和max分别表达含义。 akka.ask.timeout这个参数相对明显,就是超时,这个以前也涉及过,在cancel/submit/savepoint等情况都可能导致集群slot陆续没掉,然后再陆续回来(pass环境,基本就是部分机器失联,然后重新连接的case)。 Yun Tang <[hidden email]> 于2020年11月23日周一 下午5:11写道: > Hi > > 集群负载比较大的时候,下游一直收不到request的partition,就会导致PartitionNotFoundException,建议增大 > taskmanager.network.request-backoff.max [1][2] 以增大重试次数 > > [1] > https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#taskmanager-network-request-backoff-max > [2] https://juejin.cn/post/6844904185347964942#heading-8 > > > 祝好 > 唐云 > ________________________________ > From: 赵一旦 <[hidden email]> > Sent: Monday, November 23, 2020 13:08 > To: [hidden email] <[hidden email]> > Subject: Re: Flink任务启动偶尔报错PartitionNotFoundException,会自动恢复。 > > 这个报错和kafka没有关系的哈,我大概理解是提交任务的瞬间,jobManager/taskManager机器压力较大,存在机器之间心跳超时什么的? > 这个partition应该是指flink运行图中的数据partition,我感觉。没有具体细看,每次提交的瞬间可能遇到这个问题,然后会自动重试成功。 > > zhisheng <[hidden email]> 于2020年11月18日周三 下午10:51写道: > > > 是不是有 kafka 机器挂了? > > > > Best > > zhisheng > > > > hailongwang <[hidden email]> 于2020年11月18日周三 下午5:56写道: > > > > > 感觉还有其它 root cause,可以看下还有其它日志不? > > > > > > > > > Best, > > > Hailong > > > > > > At 2020-11-18 15:52:57, "赵一旦" <[hidden email]> wrote: > > > >2020-11-18 16:51:37 > > > >org.apache.flink.runtime.io > > .network.partition.PartitionNotFoundException: > > > >Partition > > > b225fa9143dfa179d3a3bd223165d5c5#3@3fee4d51f5a43001ef743f3f15e4cfb2 > > > >not found. > > > > at org.apache.flink.runtime.io.network.partition.consumer. > > > >RemoteInputChannel.failPartitionRequest(RemoteInputChannel.java:267) > > > > at org.apache.flink.runtime.io.network.partition.consumer. > > > > > > > > > >RemoteInputChannel.retriggerSubpartitionRequest(RemoteInputChannel.java:166) > > > > at org.apache.flink.runtime.io.network.partition.consumer. > > > >SingleInputGate.retriggerPartitionRequest(SingleInputGate.java:521) > > > > at org.apache.flink.runtime.io.network.partition.consumer. > > > > > > > > > >SingleInputGate.lambda$triggerPartitionStateCheck$1(SingleInputGate.java:765 > > > >) > > > > at > > java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture > > > >.java:670) > > > > at java.util.concurrent.CompletableFuture$UniAccept.tryFire( > > > >CompletableFuture.java:646) > > > > at java.util.concurrent.CompletableFuture$Completion.run( > > > >CompletableFuture.java:456) > > > > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) > > > > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec( > > > >ForkJoinExecutorConfigurator.scala:44) > > > > at > akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > > > > at > > akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool > > > >.java:1339) > > > > at > > > akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > > > > at > > > akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread > > > >.java:107) > > > > > > > > > > > >请问这是什么问题呢? > > > > > > |
Free forum by Nabble | Edit this page |