hi, all 集群信息: flink版本是1.10.1,部署在kubernetes上。 现象: 需要200个slot,如果指定TM个数为40,每个TM的slot个数为4,可以正常运行作业。如果指定TM为200,每个TM的slot个数为1,集群可以正常构建,ui上Available Task Slots显示为200,提交作业的时候,就会出现如下报错: Cased by: java.net.NoRouteToHostException: No route to host. 目前我遭遇到的使用场景说明如上,希望得到一些回复和解答说明,非常感谢。 Looking forward to your reply and help. Best | | a511955993 | | 邮箱:[hidden email] | 签名由 网易邮箱大师 定制 |
hi
最好能把完整的日志以及 error stack 发出来。 这个报错通常是 TM 运行的机器/pod 之间网络不通造成的,有可能和 kubernetes 的配置有关,但就目前的信息比较难确定。 Thank you~ Xintong Song On Wed, May 20, 2020 at 3:50 PM <[hidden email]> wrote: > > hi, all > > 集群信息: > flink版本是1.10.1,部署在kubernetes上。 > > 现象: > 需要200个slot,如果指定TM个数为40,每个TM的slot个数为4,可以正常运行作业。如果指定TM为200,每个TM的slot个数为1,集群可以正常构建,ui上Available > Task Slots显示为200,提交作业的时候,就会出现如下报错: > > Cased by: java.net.NoRouteToHostException: No route to host. > > 目前我遭遇到的使用场景说明如上,希望得到一些回复和解答说明,非常感谢。 > > Looking forward to your reply and help. > > Best > > | | > a511955993 > | > | > 邮箱:[hidden email] > | > > 签名由 网易邮箱大师 定制 |
In reply to this post by Chris Guo
hi,xintong,堆栈信息如下。
2020-05-20 16:46:20 org.apache.flink.runtime.io.network.partition.consumer.PartitionConnectionException: Connection for partition 66c378b86c3e100e4a2d34927c4b7281@bb397f70ad4474d2beac18d484d726af not reachable. at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:168) at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:240) at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setup(SingleInputGate.java:218) at org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setup(InputGateWithMetrics.java:65) at org.apache.flink.runtime.taskmanager.Task.setupPartitionsAndGates(Task.java:864) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:624) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:533) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Connecting the channel failed: Connecting to remote task manager + '/10.45.128.4:35285' has failed. This might indicate that the remote task manager has been lost. at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:197) at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:134) at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:86) at org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:67) at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:165) ... 7 more Caused by: org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connecting to remote task manager + '/10.45.128.4:35285' has failed. This might indicate that the remote task manager has been lost. at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:220) at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:134) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:500) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:493) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:472) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:413) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:538) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:531) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:111) at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:323) at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:339) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:685) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:632) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:549) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918) at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ... 1 more Caused by: org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: /10.45.128.4:35285 Caused by: java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:714) at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327) at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:336) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:685) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:632) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:549) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918) at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at java.lang.Thread.run(Thread.java:748) | | a511955993 | | 邮箱:[hidden email] | 签名由 网易邮箱大师 定制 在2020年05月20日 15:50,[hidden email] 写道: hi, all 集群信息: flink版本是1.10.1,部署在kubernetes上。 现象: 需要200个slot,如果指定TM个数为40,每个TM的slot个数为4,可以正常运行作业。如果指定TM为200,每个TM的slot个数为1,集群可以正常构建,ui上Available Task Slots显示为200,提交作业的时候,就会出现如下报错: Cased by: java.net.NoRouteToHostException: No route to host. 目前我遭遇到的使用场景说明如上,希望得到一些回复和解答说明,非常感谢。 Looking forward to your reply and help. Best | | a511955993 | | 邮箱:[hidden email] | 签名由 网易邮箱大师 定制 |
Hi,
从日志看,报错的根本原因是有 TM 挂掉了,导致 pod 被 remove,这样从其他 TM 上就找不到挂掉的 TM 的地址。你可以确认一下,发生错误的时候是否有 TM 挂掉/重启。 至于 TM 挂掉的原因,需要想办法获取到失败 TM 的日志。按照你之前的描述,集群启动的时候是没问题的,作业执行的时候才有问题。我现在怀疑的方向是,作业执行造成的资源问题使得 TM 发生了 OOM 或者是内存超用被 Kubernetes 杀掉了。你在修改 TM 数量、slot 数量的过程中,是否调整了 TM 的资源大小?另外即使没有调整,作业本身消耗的资源也会有所变化,例如 TM 数量变多导致每个 TM 需要建立更多的网络连接从而消耗的内存。具体还是需要根据日志分析。 Thank you~ Xintong Song On Wed, May 20, 2020 at 4:50 PM <[hidden email]> wrote: > hi,xintong,堆栈信息如下。 > > 2020-05-20 16:46:20 > org.apache.flink.runtime.io.network.partition.consumer.PartitionConnectionException: > Connection for partition > 66c378b86c3e100e4a2d34927c4b7281@bb397f70ad4474d2beac18d484d726af not > reachable. > at org.apache.flink.runtime.io > .network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:168) > at org.apache.flink.runtime.io > .network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:240) > at org.apache.flink.runtime.io > .network.partition.consumer.SingleInputGate.setup(SingleInputGate.java:218) > at > org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setup(InputGateWithMetrics.java:65) > at > org.apache.flink.runtime.taskmanager.Task.setupPartitionsAndGates(Task.java:864) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:624) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:533) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.IOException: Connecting the channel failed: Connecting > to remote task manager + '/10.45.128.4:35285' has failed. This might > indicate that the remote task manager has been lost. > at org.apache.flink.runtime.io > .network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:197) > at org.apache.flink.runtime.io > .network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:134) > at org.apache.flink.runtime.io > .network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:86) > at org.apache.flink.runtime.io > .network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:67) > at org.apache.flink.runtime.io > .network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:165) > ... 7 more > Caused by: org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: > Connecting to remote task manager + '/10.45.128.4:35285' has failed. This > might indicate that the remote task manager has been lost. > at org.apache.flink.runtime.io > .network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:220) > at org.apache.flink.runtime.io > .network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:134) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:500) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:493) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:472) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:413) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:538) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:531) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:111) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:323) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:339) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:685) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:632) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:549) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918) > at > org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > ... 1 more > Caused by: > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: > No route to host: /10.45.128.4:35285 > Caused by: java.net.NoRouteToHostException: No route to host > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:714) > at > org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:336) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:685) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:632) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:549) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918) > at > org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at java.lang.Thread.run(Thread.java:748) > > > > > | | > a511955993 > | > | > 邮箱:[hidden email] > | > > 签名由 网易邮箱大师 定制 > > 在2020年05月20日 15:50,[hidden email] 写道: > > hi, all > > 集群信息: > flink版本是1.10.1,部署在kubernetes上。 > > 现象: > 需要200个slot,如果指定TM个数为40,每个TM的slot个数为4,可以正常运行作业。如果指定TM为200,每个TM的slot个数为1,集群可以正常构建,ui上Available > Task Slots显示为200,提交作业的时候,就会出现如下报错: > > Cased by: java.net.NoRouteToHostException: No route to host. > > 目前我遭遇到的使用场景说明如上,希望得到一些回复和解答说明,非常感谢。 > > Looking forward to your reply and help. > > Best > > | | > a511955993 > | > | > 邮箱:[hidden email] > | > > 签名由 网易邮箱大师 定制 |
hi,xintong
我这边观察到的现象,从系统日志上没有找到被内核oom kill的日志。作业cancel掉后,失联的tm会重连上来,pod没有被kill掉。初步怀疑是网络层面的问题,感觉是cni有什么限制。 thanks~ | | a511955993 | | 邮箱:[hidden email] | 签名由 网易邮箱大师 定制 在2020年05月20日 17:56,Xintong Song 写道: Hi, 从日志看,报错的根本原因是有 TM 挂掉了,导致 pod 被 remove,这样从其他 TM 上就找不到挂掉的 TM 的地址。你可以确认一下,发生错误的时候是否有 TM 挂掉/重启。 至于 TM 挂掉的原因,需要想办法获取到失败 TM 的日志。按照你之前的描述,集群启动的时候是没问题的,作业执行的时候才有问题。我现在怀疑的方向是,作业执行造成的资源问题使得 TM 发生了 OOM 或者是内存超用被 Kubernetes 杀掉了。你在修改 TM 数量、slot 数量的过程中,是否调整了 TM 的资源大小?另外即使没有调整,作业本身消耗的资源也会有所变化,例如 TM 数量变多导致每个 TM 需要建立更多的网络连接从而消耗的内存。具体还是需要根据日志分析。 Thank you~ Xintong Song On Wed, May 20, 2020 at 4:50 PM <[hidden email]> wrote: > hi,xintong,堆栈信息如下。 > > 2020-05-20 16:46:20 > org.apache.flink.runtime.io.network.partition.consumer.PartitionConnectionException: > Connection for partition > 66c378b86c3e100e4a2d34927c4b7281@bb397f70ad4474d2beac18d484d726af not > reachable. > at org.apache.flink.runtime.io > .network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:168) > at org.apache.flink.runtime.io > .network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:240) > at org.apache.flink.runtime.io > .network.partition.consumer.SingleInputGate.setup(SingleInputGate.java:218) > at > org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setup(InputGateWithMetrics.java:65) > at > org.apache.flink.runtime.taskmanager.Task.setupPartitionsAndGates(Task.java:864) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:624) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:533) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.IOException: Connecting the channel failed: Connecting > to remote task manager + '/10.45.128.4:35285' has failed. This might > indicate that the remote task manager has been lost. > at org.apache.flink.runtime.io > .network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:197) > at org.apache.flink.runtime.io > .network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:134) > at org.apache.flink.runtime.io > .network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:86) > at org.apache.flink.runtime.io > .network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:67) > at org.apache.flink.runtime.io > .network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:165) > ... 7 more > Caused by: org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: > Connecting to remote task manager + '/10.45.128.4:35285' has failed. This > might indicate that the remote task manager has been lost. > at org.apache.flink.runtime.io > .network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:220) > at org.apache.flink.runtime.io > .network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:134) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:500) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:493) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:472) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:413) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:538) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:531) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:111) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:323) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:339) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:685) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:632) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:549) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918) > at > org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > ... 1 more > Caused by: > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: > No route to host: /10.45.128.4:35285 > Caused by: java.net.NoRouteToHostException: No route to host > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:714) > at > org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:336) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:685) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:632) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:549) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918) > at > org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at java.lang.Thread.run(Thread.java:748) > > > > > | | > a511955993 > | > | > 邮箱:[hidden email] > | > > 签名由 网易邮箱大师 定制 > > 在2020年05月20日 15:50,[hidden email] 写道: > > hi, all > > 集群信息: > flink版本是1.10.1,部署在kubernetes上。 > > 现象: > 需要200个slot,如果指定TM个数为40,每个TM的slot个数为4,可以正常运行作业。如果指定TM为200,每个TM的slot个数为1,集群可以正常构建,ui上Available > Task Slots显示为200,提交作业的时候,就会出现如下报错: > > Cased by: java.net.NoRouteToHostException: No route to host. > > 目前我遭遇到的使用场景说明如上,希望得到一些回复和解答说明,非常感谢。 > > Looking forward to your reply and help. > > Best > > | | > a511955993 > | > | > 邮箱:[hidden email] > | > > 签名由 网易邮箱大师 定制 |
有没有可能是 pod ip 数不够了,或者 pod 上的 ip table 限制了 entry 数量之类的?
Thank you~ Xintong Song On Wed, May 20, 2020 at 6:44 PM <[hidden email]> wrote: > hi,xintong > > 我这边观察到的现象,从系统日志上没有找到被内核oom > kill的日志。作业cancel掉后,失联的tm会重连上来,pod没有被kill掉。初步怀疑是网络层面的问题,感觉是cni有什么限制。 > > thanks~ > > > > > | | > a511955993 > | > | > 邮箱:[hidden email] > | > > 签名由 网易邮箱大师 定制 > > 在2020年05月20日 17:56,Xintong Song 写道: > Hi, > > 从日志看,报错的根本原因是有 TM 挂掉了,导致 pod 被 remove,这样从其他 TM 上就找不到挂掉的 TM > 的地址。你可以确认一下,发生错误的时候是否有 TM 挂掉/重启。 > > 至于 TM 挂掉的原因,需要想办法获取到失败 TM > 的日志。按照你之前的描述,集群启动的时候是没问题的,作业执行的时候才有问题。我现在怀疑的方向是,作业执行造成的资源问题使得 TM 发生了 OOM > 或者是内存超用被 Kubernetes 杀掉了。你在修改 TM 数量、slot 数量的过程中,是否调整了 TM > 的资源大小?另外即使没有调整,作业本身消耗的资源也会有所变化,例如 TM 数量变多导致每个 TM > 需要建立更多的网络连接从而消耗的内存。具体还是需要根据日志分析。 > > Thank you~ > > Xintong Song > > > > On Wed, May 20, 2020 at 4:50 PM <[hidden email]> wrote: > > > hi,xintong,堆栈信息如下。 > > > > 2020-05-20 16:46:20 > > org.apache.flink.runtime.io > .network.partition.consumer.PartitionConnectionException: > > Connection for partition > > 66c378b86c3e100e4a2d34927c4b7281@bb397f70ad4474d2beac18d484d726af not > > reachable. > > at org.apache.flink.runtime.io > > > .network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:168) > > at org.apache.flink.runtime.io > > > .network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:240) > > at org.apache.flink.runtime.io > > > .network.partition.consumer.SingleInputGate.setup(SingleInputGate.java:218) > > at > > > org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setup(InputGateWithMetrics.java:65) > > at > > > org.apache.flink.runtime.taskmanager.Task.setupPartitionsAndGates(Task.java:864) > > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:624) > > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:533) > > at java.lang.Thread.run(Thread.java:748) > > Caused by: java.io.IOException: Connecting the channel failed: Connecting > > to remote task manager + '/10.45.128.4:35285' has failed. This might > > indicate that the remote task manager has been lost. > > at org.apache.flink.runtime.io > > > .network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:197) > > at org.apache.flink.runtime.io > > > .network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:134) > > at org.apache.flink.runtime.io > > > .network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:86) > > at org.apache.flink.runtime.io > > > .network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:67) > > at org.apache.flink.runtime.io > > > .network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:165) > > ... 7 more > > Caused by: org.apache.flink.runtime.io > .network.netty.exception.RemoteTransportException: > > Connecting to remote task manager + '/10.45.128.4:35285' has failed. > This > > might indicate that the remote task manager has been lost. > > at org.apache.flink.runtime.io > > > .network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:220) > > at org.apache.flink.runtime.io > > > .network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:134) > > at > > > org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:500) > > at > > > org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:493) > > at > > > org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:472) > > at > > > org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:413) > > at > > > org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:538) > > at > > > org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:531) > > at > > > org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:111) > > at > > > org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:323) > > at > > > org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:339) > > at > > > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:685) > > at > > > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:632) > > at > > > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:549) > > at > > > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511) > > at > > > org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918) > > at > > > org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > > ... 1 more > > Caused by: > > > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: > > No route to host: /10.45.128.4:35285 > > Caused by: java.net.NoRouteToHostException: No route to host > > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:714) > > at > > > org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327) > > at > > > org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:336) > > at > > > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:685) > > at > > > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:632) > > at > > > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:549) > > at > > > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511) > > at > > > org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918) > > at > > > org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > > at java.lang.Thread.run(Thread.java:748) > > > > > > > > > > | | > > a511955993 > > | > > | > > 邮箱:[hidden email] > > | > > > > 签名由 网易邮箱大师 定制 > > > > 在2020年05月20日 15:50,[hidden email] 写道: > > > > hi, all > > > > 集群信息: > > flink版本是1.10.1,部署在kubernetes上。 > > > > 现象: > > > 需要200个slot,如果指定TM个数为40,每个TM的slot个数为4,可以正常运行作业。如果指定TM为200,每个TM的slot个数为1,集群可以正常构建,ui上Available > > Task Slots显示为200,提交作业的时候,就会出现如下报错: > > > > Cased by: java.net.NoRouteToHostException: No route to host. > > > > 目前我遭遇到的使用场景说明如上,希望得到一些回复和解答说明,非常感谢。 > > > > Looking forward to your reply and help. > > > > Best > > > > | | > > a511955993 > > | > > | > > 邮箱:[hidden email] > > | > > > > 签名由 网易邮箱大师 定制 > |
Free forum by Nabble | Edit this page |