TM太多,作业运行失败问题

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

TM太多,作业运行失败问题

Chris Guo

hi, all

集群信息:
flink版本是1.10.1,部署在kubernetes上。

现象:
需要200个slot,如果指定TM个数为40,每个TM的slot个数为4,可以正常运行作业。如果指定TM为200,每个TM的slot个数为1,集群可以正常构建,ui上Available Task Slots显示为200,提交作业的时候,就会出现如下报错:

Cased by: java.net.NoRouteToHostException: No route to host.

目前我遭遇到的使用场景说明如上,希望得到一些回复和解答说明,非常感谢。

Looking forward to your reply and help.

Best

| |
a511955993
|
|
邮箱:[hidden email]
|

签名由 网易邮箱大师 定制
Reply | Threaded
Open this post in threaded view
|

Re: TM太多,作业运行失败问题

Xintong Song
hi

最好能把完整的日志以及 error stack 发出来。
这个报错通常是 TM 运行的机器/pod 之间网络不通造成的,有可能和 kubernetes 的配置有关,但就目前的信息比较难确定。

Thank you~

Xintong Song



On Wed, May 20, 2020 at 3:50 PM <[hidden email]> wrote:

>
> hi, all
>
> 集群信息:
> flink版本是1.10.1,部署在kubernetes上。
>
> 现象:
> 需要200个slot,如果指定TM个数为40,每个TM的slot个数为4,可以正常运行作业。如果指定TM为200,每个TM的slot个数为1,集群可以正常构建,ui上Available
> Task Slots显示为200,提交作业的时候,就会出现如下报错:
>
> Cased by: java.net.NoRouteToHostException: No route to host.
>
> 目前我遭遇到的使用场景说明如上,希望得到一些回复和解答说明,非常感谢。
>
> Looking forward to your reply and help.
>
> Best
>
> | |
> a511955993
> |
> |
> 邮箱:[hidden email]
> |
>
> 签名由 网易邮箱大师 定制
Reply | Threaded
Open this post in threaded view
|

回复:TM太多,作业运行失败问题

Chris Guo
In reply to this post by Chris Guo
hi,xintong,堆栈信息如下。

2020-05-20 16:46:20
org.apache.flink.runtime.io.network.partition.consumer.PartitionConnectionException: Connection for partition 66c378b86c3e100e4a2d34927c4b7281@bb397f70ad4474d2beac18d484d726af not reachable.
 at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:168)
 at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:240)
 at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setup(SingleInputGate.java:218)
 at org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setup(InputGateWithMetrics.java:65)
 at org.apache.flink.runtime.taskmanager.Task.setupPartitionsAndGates(Task.java:864)
 at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:624)
 at org.apache.flink.runtime.taskmanager.Task.run(Task.java:533)
 at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Connecting the channel failed: Connecting to remote task manager + '/10.45.128.4:35285' has failed. This might indicate that the remote task manager has been lost.
 at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:197)
 at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:134)
 at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:86)
 at org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:67)
 at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:165)
 ... 7 more
Caused by: org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connecting to remote task manager + '/10.45.128.4:35285' has failed. This might indicate that the remote task manager has been lost.
 at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:220)
 at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:134)
 at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:500)
 at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:493)
 at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:472)
 at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:413)
 at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:538)
 at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:531)
 at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:111)
 at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:323)
 at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:339)
 at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:685)
 at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:632)
 at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:549)
 at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511)
 at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918)
 at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
 ... 1 more
Caused by: org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: /10.45.128.4:35285
Caused by: java.net.NoRouteToHostException: No route to host
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:714)
 at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327)
 at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:336)
 at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:685)
 at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:632)
 at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:549)
 at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511)
 at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918)
 at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
 at java.lang.Thread.run(Thread.java:748)




| |
a511955993
|
|
邮箱:[hidden email]
|

签名由 网易邮箱大师 定制

在2020年05月20日 15:50,[hidden email] 写道:

hi, all

集群信息:
flink版本是1.10.1,部署在kubernetes上。

现象:
需要200个slot,如果指定TM个数为40,每个TM的slot个数为4,可以正常运行作业。如果指定TM为200,每个TM的slot个数为1,集群可以正常构建,ui上Available Task Slots显示为200,提交作业的时候,就会出现如下报错:

Cased by: java.net.NoRouteToHostException: No route to host.

目前我遭遇到的使用场景说明如上,希望得到一些回复和解答说明,非常感谢。

Looking forward to your reply and help.

Best

| |
a511955993
|
|
邮箱:[hidden email]
|

签名由 网易邮箱大师 定制
Reply | Threaded
Open this post in threaded view
|

Re: TM太多,作业运行失败问题

Xintong Song
Hi,

从日志看,报错的根本原因是有 TM 挂掉了,导致 pod 被 remove,这样从其他 TM 上就找不到挂掉的 TM
的地址。你可以确认一下,发生错误的时候是否有 TM 挂掉/重启。

至于 TM 挂掉的原因,需要想办法获取到失败 TM
的日志。按照你之前的描述,集群启动的时候是没问题的,作业执行的时候才有问题。我现在怀疑的方向是,作业执行造成的资源问题使得 TM 发生了 OOM
或者是内存超用被 Kubernetes 杀掉了。你在修改 TM 数量、slot 数量的过程中,是否调整了 TM
的资源大小?另外即使没有调整,作业本身消耗的资源也会有所变化,例如 TM 数量变多导致每个 TM
需要建立更多的网络连接从而消耗的内存。具体还是需要根据日志分析。

Thank you~

Xintong Song



On Wed, May 20, 2020 at 4:50 PM <[hidden email]> wrote:

> hi,xintong,堆栈信息如下。
>
> 2020-05-20 16:46:20
> org.apache.flink.runtime.io.network.partition.consumer.PartitionConnectionException:
> Connection for partition
> 66c378b86c3e100e4a2d34927c4b7281@bb397f70ad4474d2beac18d484d726af not
> reachable.
>  at org.apache.flink.runtime.io
> .network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:168)
>  at org.apache.flink.runtime.io
> .network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:240)
>  at org.apache.flink.runtime.io
> .network.partition.consumer.SingleInputGate.setup(SingleInputGate.java:218)
>  at
> org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setup(InputGateWithMetrics.java:65)
>  at
> org.apache.flink.runtime.taskmanager.Task.setupPartitionsAndGates(Task.java:864)
>  at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:624)
>  at org.apache.flink.runtime.taskmanager.Task.run(Task.java:533)
>  at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Connecting the channel failed: Connecting
> to remote task manager + '/10.45.128.4:35285' has failed. This might
> indicate that the remote task manager has been lost.
>  at org.apache.flink.runtime.io
> .network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:197)
>  at org.apache.flink.runtime.io
> .network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:134)
>  at org.apache.flink.runtime.io
> .network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:86)
>  at org.apache.flink.runtime.io
> .network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:67)
>  at org.apache.flink.runtime.io
> .network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:165)
>  ... 7 more
> Caused by: org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
> Connecting to remote task manager + '/10.45.128.4:35285' has failed. This
> might indicate that the remote task manager has been lost.
>  at org.apache.flink.runtime.io
> .network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:220)
>  at org.apache.flink.runtime.io
> .network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:134)
>  at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:500)
>  at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:493)
>  at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:472)
>  at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:413)
>  at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:538)
>  at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:531)
>  at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:111)
>  at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:323)
>  at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:339)
>  at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:685)
>  at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:632)
>  at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:549)
>  at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511)
>  at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918)
>  at
> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>  ... 1 more
> Caused by:
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException:
> No route to host: /10.45.128.4:35285
> Caused by: java.net.NoRouteToHostException: No route to host
>  at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>  at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:714)
>  at
> org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327)
>  at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:336)
>  at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:685)
>  at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:632)
>  at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:549)
>  at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511)
>  at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918)
>  at
> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>  at java.lang.Thread.run(Thread.java:748)
>
>
>
>
> | |
> a511955993
> |
> |
> 邮箱:[hidden email]
> |
>
> 签名由 网易邮箱大师 定制
>
> 在2020年05月20日 15:50,[hidden email] 写道:
>
> hi, all
>
> 集群信息:
> flink版本是1.10.1,部署在kubernetes上。
>
> 现象:
> 需要200个slot,如果指定TM个数为40,每个TM的slot个数为4,可以正常运行作业。如果指定TM为200,每个TM的slot个数为1,集群可以正常构建,ui上Available
> Task Slots显示为200,提交作业的时候,就会出现如下报错:
>
> Cased by: java.net.NoRouteToHostException: No route to host.
>
> 目前我遭遇到的使用场景说明如上,希望得到一些回复和解答说明,非常感谢。
>
> Looking forward to your reply and help.
>
> Best
>
> | |
> a511955993
> |
> |
> 邮箱:[hidden email]
> |
>
> 签名由 网易邮箱大师 定制
Reply | Threaded
Open this post in threaded view
|

回复:TM太多,作业运行失败问题

Chris Guo
hi,xintong

我这边观察到的现象,从系统日志上没有找到被内核oom kill的日志。作业cancel掉后,失联的tm会重连上来,pod没有被kill掉。初步怀疑是网络层面的问题,感觉是cni有什么限制。

thanks~




| |
a511955993
|
|
邮箱:[hidden email]
|

签名由 网易邮箱大师 定制

在2020年05月20日 17:56,Xintong Song 写道:
Hi,

从日志看,报错的根本原因是有 TM 挂掉了,导致 pod 被 remove,这样从其他 TM 上就找不到挂掉的 TM
的地址。你可以确认一下,发生错误的时候是否有 TM 挂掉/重启。

至于 TM 挂掉的原因,需要想办法获取到失败 TM
的日志。按照你之前的描述,集群启动的时候是没问题的,作业执行的时候才有问题。我现在怀疑的方向是,作业执行造成的资源问题使得 TM 发生了 OOM
或者是内存超用被 Kubernetes 杀掉了。你在修改 TM 数量、slot 数量的过程中,是否调整了 TM
的资源大小?另外即使没有调整,作业本身消耗的资源也会有所变化,例如 TM 数量变多导致每个 TM
需要建立更多的网络连接从而消耗的内存。具体还是需要根据日志分析。

Thank you~

Xintong Song



On Wed, May 20, 2020 at 4:50 PM <[hidden email]> wrote:

> hi,xintong,堆栈信息如下。
>
> 2020-05-20 16:46:20
> org.apache.flink.runtime.io.network.partition.consumer.PartitionConnectionException:
> Connection for partition
> 66c378b86c3e100e4a2d34927c4b7281@bb397f70ad4474d2beac18d484d726af not
> reachable.
>  at org.apache.flink.runtime.io
> .network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:168)
>  at org.apache.flink.runtime.io
> .network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:240)
>  at org.apache.flink.runtime.io
> .network.partition.consumer.SingleInputGate.setup(SingleInputGate.java:218)
>  at
> org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setup(InputGateWithMetrics.java:65)
>  at
> org.apache.flink.runtime.taskmanager.Task.setupPartitionsAndGates(Task.java:864)
>  at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:624)
>  at org.apache.flink.runtime.taskmanager.Task.run(Task.java:533)
>  at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Connecting the channel failed: Connecting
> to remote task manager + '/10.45.128.4:35285' has failed. This might
> indicate that the remote task manager has been lost.
>  at org.apache.flink.runtime.io
> .network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:197)
>  at org.apache.flink.runtime.io
> .network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:134)
>  at org.apache.flink.runtime.io
> .network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:86)
>  at org.apache.flink.runtime.io
> .network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:67)
>  at org.apache.flink.runtime.io
> .network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:165)
>  ... 7 more
> Caused by: org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
> Connecting to remote task manager + '/10.45.128.4:35285' has failed. This
> might indicate that the remote task manager has been lost.
>  at org.apache.flink.runtime.io
> .network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:220)
>  at org.apache.flink.runtime.io
> .network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:134)
>  at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:500)
>  at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:493)
>  at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:472)
>  at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:413)
>  at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:538)
>  at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:531)
>  at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:111)
>  at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:323)
>  at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:339)
>  at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:685)
>  at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:632)
>  at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:549)
>  at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511)
>  at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918)
>  at
> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>  ... 1 more
> Caused by:
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException:
> No route to host: /10.45.128.4:35285
> Caused by: java.net.NoRouteToHostException: No route to host
>  at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>  at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:714)
>  at
> org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327)
>  at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:336)
>  at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:685)
>  at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:632)
>  at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:549)
>  at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511)
>  at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918)
>  at
> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>  at java.lang.Thread.run(Thread.java:748)
>
>
>
>
> | |
> a511955993
> |
> |
> 邮箱:[hidden email]
> |
>
> 签名由 网易邮箱大师 定制
>
> 在2020年05月20日 15:50,[hidden email] 写道:
>
> hi, all
>
> 集群信息:
> flink版本是1.10.1,部署在kubernetes上。
>
> 现象:
> 需要200个slot,如果指定TM个数为40,每个TM的slot个数为4,可以正常运行作业。如果指定TM为200,每个TM的slot个数为1,集群可以正常构建,ui上Available
> Task Slots显示为200,提交作业的时候,就会出现如下报错:
>
> Cased by: java.net.NoRouteToHostException: No route to host.
>
> 目前我遭遇到的使用场景说明如上,希望得到一些回复和解答说明,非常感谢。
>
> Looking forward to your reply and help.
>
> Best
>
> | |
> a511955993
> |
> |
> 邮箱:[hidden email]
> |
>
> 签名由 网易邮箱大师 定制
Reply | Threaded
Open this post in threaded view
|

Re: TM太多,作业运行失败问题

Xintong Song
有没有可能是 pod ip 数不够了,或者 pod 上的 ip table 限制了 entry 数量之类的?

Thank you~

Xintong Song



On Wed, May 20, 2020 at 6:44 PM <[hidden email]> wrote:

> hi,xintong
>
> 我这边观察到的现象,从系统日志上没有找到被内核oom
> kill的日志。作业cancel掉后,失联的tm会重连上来,pod没有被kill掉。初步怀疑是网络层面的问题,感觉是cni有什么限制。
>
> thanks~
>
>
>
>
> | |
> a511955993
> |
> |
> 邮箱:[hidden email]
> |
>
> 签名由 网易邮箱大师 定制
>
> 在2020年05月20日 17:56,Xintong Song 写道:
> Hi,
>
> 从日志看,报错的根本原因是有 TM 挂掉了,导致 pod 被 remove,这样从其他 TM 上就找不到挂掉的 TM
> 的地址。你可以确认一下,发生错误的时候是否有 TM 挂掉/重启。
>
> 至于 TM 挂掉的原因,需要想办法获取到失败 TM
> 的日志。按照你之前的描述,集群启动的时候是没问题的,作业执行的时候才有问题。我现在怀疑的方向是,作业执行造成的资源问题使得 TM 发生了 OOM
> 或者是内存超用被 Kubernetes 杀掉了。你在修改 TM 数量、slot 数量的过程中,是否调整了 TM
> 的资源大小?另外即使没有调整,作业本身消耗的资源也会有所变化,例如 TM 数量变多导致每个 TM
> 需要建立更多的网络连接从而消耗的内存。具体还是需要根据日志分析。
>
> Thank you~
>
> Xintong Song
>
>
>
> On Wed, May 20, 2020 at 4:50 PM <[hidden email]> wrote:
>
> > hi,xintong,堆栈信息如下。
> >
> > 2020-05-20 16:46:20
> > org.apache.flink.runtime.io
> .network.partition.consumer.PartitionConnectionException:
> > Connection for partition
> > 66c378b86c3e100e4a2d34927c4b7281@bb397f70ad4474d2beac18d484d726af not
> > reachable.
> >  at org.apache.flink.runtime.io
> >
> .network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:168)
> >  at org.apache.flink.runtime.io
> >
> .network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:240)
> >  at org.apache.flink.runtime.io
> >
> .network.partition.consumer.SingleInputGate.setup(SingleInputGate.java:218)
> >  at
> >
> org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setup(InputGateWithMetrics.java:65)
> >  at
> >
> org.apache.flink.runtime.taskmanager.Task.setupPartitionsAndGates(Task.java:864)
> >  at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:624)
> >  at org.apache.flink.runtime.taskmanager.Task.run(Task.java:533)
> >  at java.lang.Thread.run(Thread.java:748)
> > Caused by: java.io.IOException: Connecting the channel failed: Connecting
> > to remote task manager + '/10.45.128.4:35285' has failed. This might
> > indicate that the remote task manager has been lost.
> >  at org.apache.flink.runtime.io
> >
> .network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:197)
> >  at org.apache.flink.runtime.io
> >
> .network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:134)
> >  at org.apache.flink.runtime.io
> >
> .network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:86)
> >  at org.apache.flink.runtime.io
> >
> .network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:67)
> >  at org.apache.flink.runtime.io
> >
> .network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:165)
> >  ... 7 more
> > Caused by: org.apache.flink.runtime.io
> .network.netty.exception.RemoteTransportException:
> > Connecting to remote task manager + '/10.45.128.4:35285' has failed.
> This
> > might indicate that the remote task manager has been lost.
> >  at org.apache.flink.runtime.io
> >
> .network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:220)
> >  at org.apache.flink.runtime.io
> >
> .network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:134)
> >  at
> >
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:500)
> >  at
> >
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:493)
> >  at
> >
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:472)
> >  at
> >
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:413)
> >  at
> >
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:538)
> >  at
> >
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:531)
> >  at
> >
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:111)
> >  at
> >
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:323)
> >  at
> >
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:339)
> >  at
> >
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:685)
> >  at
> >
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:632)
> >  at
> >
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:549)
> >  at
> >
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511)
> >  at
> >
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918)
> >  at
> >
> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> >  ... 1 more
> > Caused by:
> >
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException:
> > No route to host: /10.45.128.4:35285
> > Caused by: java.net.NoRouteToHostException: No route to host
> >  at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> >  at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:714)
> >  at
> >
> org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327)
> >  at
> >
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:336)
> >  at
> >
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:685)
> >  at
> >
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:632)
> >  at
> >
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:549)
> >  at
> >
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511)
> >  at
> >
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918)
> >  at
> >
> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> >  at java.lang.Thread.run(Thread.java:748)
> >
> >
> >
> >
> > | |
> > a511955993
> > |
> > |
> > 邮箱:[hidden email]
> > |
> >
> > 签名由 网易邮箱大师 定制
> >
> > 在2020年05月20日 15:50,[hidden email] 写道:
> >
> > hi, all
> >
> > 集群信息:
> > flink版本是1.10.1,部署在kubernetes上。
> >
> > 现象:
> >
> 需要200个slot,如果指定TM个数为40,每个TM的slot个数为4,可以正常运行作业。如果指定TM为200,每个TM的slot个数为1,集群可以正常构建,ui上Available
> > Task Slots显示为200,提交作业的时候,就会出现如下报错:
> >
> > Cased by: java.net.NoRouteToHostException: No route to host.
> >
> > 目前我遭遇到的使用场景说明如上,希望得到一些回复和解答说明,非常感谢。
> >
> > Looking forward to your reply and help.
> >
> > Best
> >
> > | |
> > a511955993
> > |
> > |
> > 邮箱:[hidden email]
> > |
> >
> > 签名由 网易邮箱大师 定制
>