Attachment is the exception stack from flink's web-ui. Does anyone
have also met this problem? Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers, each 28G mem. |
Hi Yidan,
it seems that the attachment did not make it through the mailing list. Can you copy-paste the text of the exception here or upload the log somewhere? On Wed, Jun 16, 2021 at 9:36 AM yidan zhao <[hidden email]> wrote: > Attachment is the exception stack from flink's web-ui. Does anyone > have also met this problem? > > Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers, > each 28G mem. > |
Hi, here is the text exception stack:
org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: readAddress(..) failed: Connection timed out (connection to '10.35.215.18/10.35.215.18:2045') at org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.exceptionCaught(CreditBasedPartitionRequestClientHandler.java:201) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:273) at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.exceptionCaught(DefaultChannelPipeline.java:1377) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281) at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:907) at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.handleReadException(AbstractEpollStreamChannel.java:728) at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:818) at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:475) at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection timed out Robert Metzger <[hidden email]> 于2021年6月16日周三 下午4:26写道: > > Hi Yidan, > it seems that the attachment did not make it through the mailing list. Can > you copy-paste the text of the exception here or upload the log somewhere? > > > > On Wed, Jun 16, 2021 at 9:36 AM yidan zhao <[hidden email]> wrote: > > > Attachment is the exception stack from flink's web-ui. Does anyone > > have also met this problem? > > > > Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers, > > each 28G mem. > > |
In reply to this post by nobleyd
Hi yidan,
1. Is the network stable? 2. Is there any GC problem? 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information. 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2]. 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node. Hope this helps. [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/ [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/ [3] https://issues.apache.org/jira/browse/FLINK-22643 Best, Yingjie yidan zhao <[hidden email]> 于2021年6月16日周三 下午3:36写道: > Attachment is the exception stack from flink's web-ui. Does anyone > have also met this problem? > > Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers, > each 28G mem. > |
2: I use G1, and no full gc occurred, young gc count: 422, time:
142892, so it is not bad. 3: stream job. 4: I will try to config taskmanager.network.retries which is default 0, and taskmanager.network.netty.client.connectTimeoutSec 's default is 120s。 5: I checked the net fd number of the taskmanager, it is about 1000+, so I think it is a reasonable value. 1: can not be sure. Yingjie Cao <[hidden email]> 于2021年6月16日周三 下午4:34写道: > > Hi yidan, > > 1. Is the network stable? > 2. Is there any GC problem? > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information. > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2]. > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node. > > Hope this helps. > > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/ > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/ > [3] https://issues.apache.org/jira/browse/FLINK-22643 > > Best, > Yingjie > > yidan zhao <[hidden email]> 于2021年6月16日周三 下午3:36写道: >> >> Attachment is the exception stack from flink's web-ui. Does anyone >> have also met this problem? >> >> Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers, >> each 28G mem. |
Hi, yingjie.
If the network is not stable, which config parameter I should adjust. yidan zhao <[hidden email]> 于2021年6月16日周三 下午6:56写道: > > 2: I use G1, and no full gc occurred, young gc count: 422, time: > 142892, so it is not bad. > 3: stream job. > 4: I will try to config taskmanager.network.retries which is default > 0, and taskmanager.network.netty.client.connectTimeoutSec 's default > is 120s。 > 5: I checked the net fd number of the taskmanager, it is about 1000+, > so I think it is a reasonable value. > > 1: can not be sure. > > Yingjie Cao <[hidden email]> 于2021年6月16日周三 下午4:34写道: > > > > Hi yidan, > > > > 1. Is the network stable? > > 2. Is there any GC problem? > > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information. > > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2]. > > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node. > > > > Hope this helps. > > > > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/ > > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/ > > [3] https://issues.apache.org/jira/browse/FLINK-22643 > > > > Best, > > Yingjie > > > > yidan zhao <[hidden email]> 于2021年6月16日周三 下午3:36写道: > >> > >> Attachment is the exception stack from flink's web-ui. Does anyone > >> have also met this problem? > >> > >> Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers, > >> each 28G mem. |
I also searched many result in internet. There are some related
exception like org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException, but in my case it is org.apache.flink.runtime.io.network.netty.exception.LocalTransportException. It is different in 'LocalTransportException' or 'RemoteTransportException'. yidan zhao <[hidden email]> 于2021年6月16日周三 下午7:10写道: > > Hi, yingjie. > If the network is not stable, which config parameter I should adjust. > > yidan zhao <[hidden email]> 于2021年6月16日周三 下午6:56写道: > > > > 2: I use G1, and no full gc occurred, young gc count: 422, time: > > 142892, so it is not bad. > > 3: stream job. > > 4: I will try to config taskmanager.network.retries which is default > > 0, and taskmanager.network.netty.client.connectTimeoutSec 's default > > is 120s。 > > 5: I checked the net fd number of the taskmanager, it is about 1000+, > > so I think it is a reasonable value. > > > > 1: can not be sure. > > > > Yingjie Cao <[hidden email]> 于2021年6月16日周三 下午4:34写道: > > > > > > Hi yidan, > > > > > > 1. Is the network stable? > > > 2. Is there any GC problem? > > > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information. > > > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2]. > > > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node. > > > > > > Hope this helps. > > > > > > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/ > > > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/ > > > [3] https://issues.apache.org/jira/browse/FLINK-22643 > > > > > > Best, > > > Yingjie > > > > > > yidan zhao <[hidden email]> 于2021年6月16日周三 下午3:36写道: > > >> > > >> Attachment is the exception stack from flink's web-ui. Does anyone > > >> have also met this problem? > > >> > > >> Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers, > > >> each 28G mem. |
In reply to this post by nobleyd
Maybe you can try to
increase taskmanager.network.retries, taskmanager.network.netty.server.backlog and taskmanager.network.netty.sendReceiveBufferSize. These options are useful for our jobs. yidan zhao <[hidden email]> 于2021年6月16日周三 下午7:10写道: > Hi, yingjie. > If the network is not stable, which config parameter I should adjust. > > yidan zhao <[hidden email]> 于2021年6月16日周三 下午6:56写道: > > > > 2: I use G1, and no full gc occurred, young gc count: 422, time: > > 142892, so it is not bad. > > 3: stream job. > > 4: I will try to config taskmanager.network.retries which is default > > 0, and taskmanager.network.netty.client.connectTimeoutSec 's default > > is 120s。 > > 5: I checked the net fd number of the taskmanager, it is about 1000+, > > so I think it is a reasonable value. > > > > 1: can not be sure. > > > > Yingjie Cao <[hidden email]> 于2021年6月16日周三 下午4:34写道: > > > > > > Hi yidan, > > > > > > 1. Is the network stable? > > > 2. Is there any GC problem? > > > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more > information. > > > 4. You may try to config these two options: > taskmanager.network.retries, > taskmanager.network.netty.client.connectTimeoutSec. More relevant options > can be found in 'Data Transport Network Stack' section of [2]. > > > 5. If it is not the above cases, it is may related to [3], you may > need to check the number of tcp connection per TM and node. > > > > > > Hope this helps. > > > > > > [1] > https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/ > > > [2] > https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/ > > > [3] https://issues.apache.org/jira/browse/FLINK-22643 > > > > > > Best, > > > Yingjie > > > > > > yidan zhao <[hidden email]> 于2021年6月16日周三 下午3:36写道: > > >> > > >> Attachment is the exception stack from flink's web-ui. Does anyone > > >> have also met this problem? > > >> > > >> Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers, > > >> each 28G mem. > |
Ok, I will try.
Yingjie Cao <[hidden email]> 于2021年6月16日周三 下午8:00写道: > > Maybe you can try to increase taskmanager.network.retries, taskmanager.network.netty.server.backlog and taskmanager.network.netty.sendReceiveBufferSize. These options are useful for our jobs. > > yidan zhao <[hidden email]> 于2021年6月16日周三 下午7:10写道: >> >> Hi, yingjie. >> If the network is not stable, which config parameter I should adjust. >> >> yidan zhao <[hidden email]> 于2021年6月16日周三 下午6:56写道: >> > >> > 2: I use G1, and no full gc occurred, young gc count: 422, time: >> > 142892, so it is not bad. >> > 3: stream job. >> > 4: I will try to config taskmanager.network.retries which is default >> > 0, and taskmanager.network.netty.client.connectTimeoutSec 's default >> > is 120s。 >> > 5: I checked the net fd number of the taskmanager, it is about 1000+, >> > so I think it is a reasonable value. >> > >> > 1: can not be sure. >> > >> > Yingjie Cao <[hidden email]> 于2021年6月16日周三 下午4:34写道: >> > > >> > > Hi yidan, >> > > >> > > 1. Is the network stable? >> > > 2. Is there any GC problem? >> > > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information. >> > > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2]. >> > > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node. >> > > >> > > Hope this helps. >> > > >> > > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/ >> > > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/ >> > > [3] https://issues.apache.org/jira/browse/FLINK-22643 >> > > >> > > Best, >> > > Yingjie >> > > >> > > yidan zhao <[hidden email]> 于2021年6月16日周三 下午3:36写道: >> > >> >> > >> Attachment is the exception stack from flink's web-ui. Does anyone >> > >> have also met this problem? >> > >> >> > >> Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers, >> > >> each 28G mem. |
In reply to this post by nobleyd
单机standalone,还是Docker/K8s ?
这个异常出现的时机,与周期性的,还是跟CPU、内存,乃至网络流量变化相关? 在 2021-06-16 19:10:24,"yidan zhao" <[hidden email]> 写道: >Hi, yingjie. >If the network is not stable, which config parameter I should adjust. > >yidan zhao <[hidden email]> 于2021年6月16日周三 下午6:56写道: >> >> 2: I use G1, and no full gc occurred, young gc count: 422, time: >> 142892, so it is not bad. >> 3: stream job. >> 4: I will try to config taskmanager.network.retries which is default >> 0, and taskmanager.network.netty.client.connectTimeoutSec 's default >> is 120s。 >> 5: I checked the net fd number of the taskmanager, it is about 1000+, >> so I think it is a reasonable value. >> >> 1: can not be sure. >> >> Yingjie Cao <[hidden email]> 于2021年6月16日周三 下午4:34写道: >> > >> > Hi yidan, >> > >> > 1. Is the network stable? >> > 2. Is there any GC problem? >> > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information. >> > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2]. >> > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node. >> > >> > Hope this helps. >> > >> > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/ >> > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/ >> > [3] https://issues.apache.org/jira/browse/FLINK-22643 >> > >> > Best, >> > Yingjie >> > >> > yidan zhao <[hidden email]> 于2021年6月16日周三 下午3:36写道: >> >> >> >> Attachment is the exception stack from flink's web-ui. Does anyone >> >> have also met this problem? >> >> >> >> Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers, >> >> each 28G mem. |
@东东 standalone集群。 随机时间,一会一个的,没有固定规律。 和CPU、内存、网络的话有一定规律,但不确认,因为不是很明显。
我排查过几个exception,时间和网络尖刺对上了,但不全能对上,所以不好说是否有这个原因。 此外,有个点我不是很清楚,网上这个报错很少,类似的都是 RemoteTransportException,然后提示中说taskmager可能已丢失之类。但我的是 LocalTransportException,不清楚netty中这俩错误的含义是不是不一样。目前来看网络上关于这俩异常的资料也查不到什么。 东东 <[hidden email]> 于2021年6月17日周四 上午11:19写道: > > 单机standalone,还是Docker/K8s ? > > > > 这个异常出现的时机,与周期性的,还是跟CPU、内存,乃至网络流量变化相关? > > > > 在 2021-06-16 19:10:24,"yidan zhao" <[hidden email]> 写道: > >Hi, yingjie. > >If the network is not stable, which config parameter I should adjust. > > > >yidan zhao <[hidden email]> 于2021年6月16日周三 下午6:56写道: > >> > >> 2: I use G1, and no full gc occurred, young gc count: 422, time: > >> 142892, so it is not bad. > >> 3: stream job. > >> 4: I will try to config taskmanager.network.retries which is default > >> 0, and taskmanager.network.netty.client.connectTimeoutSec 's default > >> is 120s。 > >> 5: I checked the net fd number of the taskmanager, it is about 1000+, > >> so I think it is a reasonable value. > >> > >> 1: can not be sure. > >> > >> Yingjie Cao <[hidden email]> 于2021年6月16日周三 下午4:34写道: > >> > > >> > Hi yidan, > >> > > >> > 1. Is the network stable? > >> > 2. Is there any GC problem? > >> > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information. > >> > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2]. > >> > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node. > >> > > >> > Hope this helps. > >> > > >> > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/ > >> > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/ > >> > [3] https://issues.apache.org/jira/browse/FLINK-22643 > >> > > >> > Best, > >> > Yingjie > >> > > >> > yidan zhao <[hidden email]> 于2021年6月16日周三 下午3:36写道: > >> >> > >> >> Attachment is the exception stack from flink's web-ui. Does anyone > >> >> have also met this problem? > >> >> > >> >> Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers, > >> >> each 28G mem. |
10.35.215.18是宿主机IP?
看一下 tcp_tw_recycle和net.ipv4.tcp_timestamps是什么值 实在不行就 tcpdump 吧 在 2021-06-17 12:41:58,"yidan zhao" <[hidden email]> 写道: >@东东 standalone集群。 随机时间,一会一个的,没有固定规律。 和CPU、内存、网络的话有一定规律,但不确认,因为不是很明显。 >我排查过几个exception,时间和网络尖刺对上了,但不全能对上,所以不好说是否有这个原因。 > >此外,有个点我不是很清楚,网上这个报错很少,类似的都是 >RemoteTransportException,然后提示中说taskmager可能已丢失之类。但我的是 >LocalTransportException,不清楚netty中这俩错误的含义是不是不一样。目前来看网络上关于这俩异常的资料也查不到什么。 > >东东 <[hidden email]> 于2021年6月17日周四 上午11:19写道: >> >> 单机standalone,还是Docker/K8s ? >> >> >> >> 这个异常出现的时机,与周期性的,还是跟CPU、内存,乃至网络流量变化相关? >> >> >> >> 在 2021-06-16 19:10:24,"yidan zhao" <[hidden email]> 写道: >> >Hi, yingjie. >> >If the network is not stable, which config parameter I should adjust. >> > >> >yidan zhao <[hidden email]> 于2021年6月16日周三 下午6:56写道: >> >> >> >> 2: I use G1, and no full gc occurred, young gc count: 422, time: >> >> 142892, so it is not bad. >> >> 3: stream job. >> >> 4: I will try to config taskmanager.network.retries which is default >> >> 0, and taskmanager.network.netty.client.connectTimeoutSec 's default >> >> is 120s。 >> >> 5: I checked the net fd number of the taskmanager, it is about 1000+, >> >> so I think it is a reasonable value. >> >> >> >> 1: can not be sure. >> >> >> >> Yingjie Cao <[hidden email]> 于2021年6月16日周三 下午4:34写道: >> >> > >> >> > Hi yidan, >> >> > >> >> > 1. Is the network stable? >> >> > 2. Is there any GC problem? >> >> > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information. >> >> > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2]. >> >> > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node. >> >> > >> >> > Hope this helps. >> >> > >> >> > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/ >> >> > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/ >> >> > [3] https://issues.apache.org/jira/browse/FLINK-22643 >> >> > >> >> > Best, >> >> > Yingjie >> >> > >> >> > yidan zhao <[hidden email]> 于2021年6月16日周三 下午3:36写道: >> >> >> >> >> >> Attachment is the exception stack from flink's web-ui. Does anyone >> >> >> have also met this problem? >> >> >> >> >> >> Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers, >> >> >> each 28G mem. |
是的,宿主机IP。
net.ipv4.tcp_tw_reuse = 1 net.ipv4.tcp_timestamps = 1 东东 <[hidden email]> 于2021年6月17日周四 下午12:52写道: > > 10.35.215.18是宿主机IP? > > 看一下 tcp_tw_recycle和net.ipv4.tcp_timestamps是什么值 > 实在不行就 tcpdump 吧 > > > > 在 2021-06-17 12:41:58,"yidan zhao" <[hidden email]> 写道: > >@东东 standalone集群。 随机时间,一会一个的,没有固定规律。 和CPU、内存、网络的话有一定规律,但不确认,因为不是很明显。 > >我排查过几个exception,时间和网络尖刺对上了,但不全能对上,所以不好说是否有这个原因。 > > > >此外,有个点我不是很清楚,网上这个报错很少,类似的都是 > >RemoteTransportException,然后提示中说taskmager可能已丢失之类。但我的是 > >LocalTransportException,不清楚netty中这俩错误的含义是不是不一样。目前来看网络上关于这俩异常的资料也查不到什么。 > > > >东东 <[hidden email]> 于2021年6月17日周四 上午11:19写道: > >> > >> 单机standalone,还是Docker/K8s ? > >> > >> > >> > >> 这个异常出现的时机,与周期性的,还是跟CPU、内存,乃至网络流量变化相关? > >> > >> > >> > >> 在 2021-06-16 19:10:24,"yidan zhao" <[hidden email]> 写道: > >> >Hi, yingjie. > >> >If the network is not stable, which config parameter I should adjust. > >> > > >> >yidan zhao <[hidden email]> 于2021年6月16日周三 下午6:56写道: > >> >> > >> >> 2: I use G1, and no full gc occurred, young gc count: 422, time: > >> >> 142892, so it is not bad. > >> >> 3: stream job. > >> >> 4: I will try to config taskmanager.network.retries which is default > >> >> 0, and taskmanager.network.netty.client.connectTimeoutSec 's default > >> >> is 120s。 > >> >> 5: I checked the net fd number of the taskmanager, it is about 1000+, > >> >> so I think it is a reasonable value. > >> >> > >> >> 1: can not be sure. > >> >> > >> >> Yingjie Cao <[hidden email]> 于2021年6月16日周三 下午4:34写道: > >> >> > > >> >> > Hi yidan, > >> >> > > >> >> > 1. Is the network stable? > >> >> > 2. Is there any GC problem? > >> >> > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information. > >> >> > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2]. > >> >> > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node. > >> >> > > >> >> > Hope this helps. > >> >> > > >> >> > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/ > >> >> > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/ > >> >> > [3] https://issues.apache.org/jira/browse/FLINK-22643 > >> >> > > >> >> > Best, > >> >> > Yingjie > >> >> > > >> >> > yidan zhao <[hidden email]> 于2021年6月16日周三 下午3:36写道: > >> >> >> > >> >> >> Attachment is the exception stack from flink's web-ui. Does anyone > >> >> >> have also met this problem? > >> >> >> > >> >> >> Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers, > >> >> >> each 28G mem. |
把其中一个改成0 在 2021-06-17 13:11:01,"yidan zhao" <[hidden email]> 写道: >是的,宿主机IP。 > >net.ipv4.tcp_tw_reuse = 1 >net.ipv4.tcp_timestamps = 1 > >东东 <[hidden email]> 于2021年6月17日周四 下午12:52写道: >> >> 10.35.215.18是宿主机IP? >> >> 看一下 tcp_tw_recycle和net.ipv4.tcp_timestamps是什么值 >> 实在不行就 tcpdump 吧 >> >> >> >> 在 2021-06-17 12:41:58,"yidan zhao" <[hidden email]> 写道: >> >@东东 standalone集群。 随机时间,一会一个的,没有固定规律。 和CPU、内存、网络的话有一定规律,但不确认,因为不是很明显。 >> >我排查过几个exception,时间和网络尖刺对上了,但不全能对上,所以不好说是否有这个原因。 >> > >> >此外,有个点我不是很清楚,网上这个报错很少,类似的都是 >> >RemoteTransportException,然后提示中说taskmager可能已丢失之类。但我的是 >> >LocalTransportException,不清楚netty中这俩错误的含义是不是不一样。目前来看网络上关于这俩异常的资料也查不到什么。 >> > >> >东东 <[hidden email]> 于2021年6月17日周四 上午11:19写道: >> >> >> >> 单机standalone,还是Docker/K8s ? >> >> >> >> >> >> >> >> 这个异常出现的时机,与周期性的,还是跟CPU、内存,乃至网络流量变化相关? >> >> >> >> >> >> >> >> 在 2021-06-16 19:10:24,"yidan zhao" <[hidden email]> 写道: >> >> >Hi, yingjie. >> >> >If the network is not stable, which config parameter I should adjust. >> >> > >> >> >yidan zhao <[hidden email]> 于2021年6月16日周三 下午6:56写道: >> >> >> >> >> >> 2: I use G1, and no full gc occurred, young gc count: 422, time: >> >> >> 142892, so it is not bad. >> >> >> 3: stream job. >> >> >> 4: I will try to config taskmanager.network.retries which is default >> >> >> 0, and taskmanager.network.netty.client.connectTimeoutSec 's default >> >> >> is 120s。 >> >> >> 5: I checked the net fd number of the taskmanager, it is about 1000+, >> >> >> so I think it is a reasonable value. >> >> >> >> >> >> 1: can not be sure. >> >> >> >> >> >> Yingjie Cao <[hidden email]> 于2021年6月16日周三 下午4:34写道: >> >> >> > >> >> >> > Hi yidan, >> >> >> > >> >> >> > 1. Is the network stable? >> >> >> > 2. Is there any GC problem? >> >> >> > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information. >> >> >> > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2]. >> >> >> > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node. >> >> >> > >> >> >> > Hope this helps. >> >> >> > >> >> >> > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/ >> >> >> > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/ >> >> >> > [3] https://issues.apache.org/jira/browse/FLINK-22643 >> >> >> > >> >> >> > Best, >> >> >> > Yingjie >> >> >> > >> >> >> > yidan zhao <[hidden email]> 于2021年6月16日周三 下午3:36写道: >> >> >> >> >> >> >> >> Attachment is the exception stack from flink's web-ui. Does anyone >> >> >> >> have also met this problem? >> >> >> >> >> >> >> >> Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers, >> >> >> >> each 28G mem. |
这啥原理,这个改动我没办法直接改,需要申请。
东东 <[hidden email]> 于2021年6月17日周四 下午1:36写道: > > > > 把其中一个改成0 > > > 在 2021-06-17 13:11:01,"yidan zhao" <[hidden email]> 写道: > >是的,宿主机IP。 > > > >net.ipv4.tcp_tw_reuse = 1 > >net.ipv4.tcp_timestamps = 1 > > > >东东 <[hidden email]> 于2021年6月17日周四 下午12:52写道: > >> > >> 10.35.215.18是宿主机IP? > >> > >> 看一下 tcp_tw_recycle和net.ipv4.tcp_timestamps是什么值 > >> 实在不行就 tcpdump 吧 > >> > >> > >> > >> 在 2021-06-17 12:41:58,"yidan zhao" <[hidden email]> 写道: > >> >@东东 standalone集群。 随机时间,一会一个的,没有固定规律。 和CPU、内存、网络的话有一定规律,但不确认,因为不是很明显。 > >> >我排查过几个exception,时间和网络尖刺对上了,但不全能对上,所以不好说是否有这个原因。 > >> > > >> >此外,有个点我不是很清楚,网上这个报错很少,类似的都是 > >> >RemoteTransportException,然后提示中说taskmager可能已丢失之类。但我的是 > >> >LocalTransportException,不清楚netty中这俩错误的含义是不是不一样。目前来看网络上关于这俩异常的资料也查不到什么。 > >> > > >> >东东 <[hidden email]> 于2021年6月17日周四 上午11:19写道: > >> >> > >> >> 单机standalone,还是Docker/K8s ? > >> >> > >> >> > >> >> > >> >> 这个异常出现的时机,与周期性的,还是跟CPU、内存,乃至网络流量变化相关? > >> >> > >> >> > >> >> > >> >> 在 2021-06-16 19:10:24,"yidan zhao" <[hidden email]> 写道: > >> >> >Hi, yingjie. > >> >> >If the network is not stable, which config parameter I should adjust. > >> >> > > >> >> >yidan zhao <[hidden email]> 于2021年6月16日周三 下午6:56写道: > >> >> >> > >> >> >> 2: I use G1, and no full gc occurred, young gc count: 422, time: > >> >> >> 142892, so it is not bad. > >> >> >> 3: stream job. > >> >> >> 4: I will try to config taskmanager.network.retries which is default > >> >> >> 0, and taskmanager.network.netty.client.connectTimeoutSec 's default > >> >> >> is 120s。 > >> >> >> 5: I checked the net fd number of the taskmanager, it is about 1000+, > >> >> >> so I think it is a reasonable value. > >> >> >> > >> >> >> 1: can not be sure. > >> >> >> > >> >> >> Yingjie Cao <[hidden email]> 于2021年6月16日周三 下午4:34写道: > >> >> >> > > >> >> >> > Hi yidan, > >> >> >> > > >> >> >> > 1. Is the network stable? > >> >> >> > 2. Is there any GC problem? > >> >> >> > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information. > >> >> >> > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2]. > >> >> >> > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node. > >> >> >> > > >> >> >> > Hope this helps. > >> >> >> > > >> >> >> > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/ > >> >> >> > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/ > >> >> >> > [3] https://issues.apache.org/jira/browse/FLINK-22643 > >> >> >> > > >> >> >> > Best, > >> >> >> > Yingjie > >> >> >> > > >> >> >> > yidan zhao <[hidden email]> 于2021年6月16日周三 下午3:36写道: > >> >> >> >> > >> >> >> >> Attachment is the exception stack from flink's web-ui. Does anyone > >> >> >> >> have also met this problem? > >> >> >> >> > >> >> >> >> Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers, > >> >> >> >> each 28G mem. |
这俩都开启的话,就要求同一源ip的连接请求中的timstamp必须是递增的,否则(非递增)的连接请求被视为无效,数据包会被抛弃,给client端的感觉就是时不时的连接超时。
一般来说单机不会有这个问题,因为时钟应该是一个,在NAT后面才容易出现这个现象(因为多个主机时钟通常不完全一致),但不清楚你的具体架构,只能说试一试。 最后,可以跟运维讨论一下,除非确信不会有经过NAT过来的链接,否则这俩最好别都开。 PS: kernel 4.1里面已经把 tcp_tw_reuse 这玩意废掉了,因为太多人掉这坑里了 在 2021-06-17 14:07:50,"yidan zhao" <[hidden email]> 写道: >这啥原理,这个改动我没办法直接改,需要申请。 > >东东 <[hidden email]> 于2021年6月17日周四 下午1:36写道: >> >> >> >> 把其中一个改成0 >> >> >> 在 2021-06-17 13:11:01,"yidan zhao" <[hidden email]> 写道: >> >是的,宿主机IP。 >> > >> >net.ipv4.tcp_tw_reuse = 1 >> >net.ipv4.tcp_timestamps = 1 >> > >> >东东 <[hidden email]> 于2021年6月17日周四 下午12:52写道: >> >> >> >> 10.35.215.18是宿主机IP? >> >> >> >> 看一下 tcp_tw_recycle和net.ipv4.tcp_timestamps是什么值 >> >> 实在不行就 tcpdump 吧 >> >> >> >> >> >> >> >> 在 2021-06-17 12:41:58,"yidan zhao" <[hidden email]> 写道: >> >> >@东东 standalone集群。 随机时间,一会一个的,没有固定规律。 和CPU、内存、网络的话有一定规律,但不确认,因为不是很明显。 >> >> >我排查过几个exception,时间和网络尖刺对上了,但不全能对上,所以不好说是否有这个原因。 >> >> > >> >> >此外,有个点我不是很清楚,网上这个报错很少,类似的都是 >> >> >RemoteTransportException,然后提示中说taskmager可能已丢失之类。但我的是 >> >> >LocalTransportException,不清楚netty中这俩错误的含义是不是不一样。目前来看网络上关于这俩异常的资料也查不到什么。 >> >> > >> >> >东东 <[hidden email]> 于2021年6月17日周四 上午11:19写道: >> >> >> >> >> >> 单机standalone,还是Docker/K8s ? >> >> >> >> >> >> >> >> >> >> >> >> 这个异常出现的时机,与周期性的,还是跟CPU、内存,乃至网络流量变化相关? >> >> >> >> >> >> >> >> >> >> >> >> 在 2021-06-16 19:10:24,"yidan zhao" <[hidden email]> 写道: >> >> >> >Hi, yingjie. >> >> >> >If the network is not stable, which config parameter I should adjust. >> >> >> > >> >> >> >yidan zhao <[hidden email]> 于2021年6月16日周三 下午6:56写道: >> >> >> >> >> >> >> >> 2: I use G1, and no full gc occurred, young gc count: 422, time: >> >> >> >> 142892, so it is not bad. >> >> >> >> 3: stream job. >> >> >> >> 4: I will try to config taskmanager.network.retries which is default >> >> >> >> 0, and taskmanager.network.netty.client.connectTimeoutSec 's default >> >> >> >> is 120s。 >> >> >> >> 5: I checked the net fd number of the taskmanager, it is about 1000+, >> >> >> >> so I think it is a reasonable value. >> >> >> >> >> >> >> >> 1: can not be sure. >> >> >> >> >> >> >> >> Yingjie Cao <[hidden email]> 于2021年6月16日周三 下午4:34写道: >> >> >> >> > >> >> >> >> > Hi yidan, >> >> >> >> > >> >> >> >> > 1. Is the network stable? >> >> >> >> > 2. Is there any GC problem? >> >> >> >> > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information. >> >> >> >> > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2]. >> >> >> >> > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node. >> >> >> >> > >> >> >> >> > Hope this helps. >> >> >> >> > >> >> >> >> > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/ >> >> >> >> > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/ >> >> >> >> > [3] https://issues.apache.org/jira/browse/FLINK-22643 >> >> >> >> > >> >> >> >> > Best, >> >> >> >> > Yingjie >> >> >> >> > >> >> >> >> > yidan zhao <[hidden email]> 于2021年6月16日周三 下午3:36写道: >> >> >> >> >> >> >> >> >> >> Attachment is the exception stack from flink's web-ui. Does anyone >> >> >> >> >> have also met this problem? >> >> >> >> >> >> >> >> >> >> Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers, >> >> >> >> >> each 28G mem. |
我仔细想了想,我的集群是内网服务器上的容器,容器之间访问应该不算经过NAT。
当然和网络相关的监控来看,的确很多机器的time-wait状态的连接不少,在5w+个左右,但也不至于导致这个问题感觉。 东东 <[hidden email]> 于2021年6月17日周四 下午2:48写道: > > 这俩都开启的话,就要求同一源ip的连接请求中的timstamp必须是递增的,否则(非递增)的连接请求被视为无效,数据包会被抛弃,给client端的感觉就是时不时的连接超时。 > > > > 一般来说单机不会有这个问题,因为时钟应该是一个,在NAT后面才容易出现这个现象(因为多个主机时钟通常不完全一致),但不清楚你的具体架构,只能说试一试。 > > > 最后,可以跟运维讨论一下,除非确信不会有经过NAT过来的链接,否则这俩最好别都开。 > > > PS: kernel 4.1里面已经把 tcp_tw_reuse 这玩意废掉了,因为太多人掉这坑里了 > > > 在 2021-06-17 14:07:50,"yidan zhao" <[hidden email]> 写道: > >这啥原理,这个改动我没办法直接改,需要申请。 > > > >东东 <[hidden email]> 于2021年6月17日周四 下午1:36写道: > >> > >> > >> > >> 把其中一个改成0 > >> > >> > >> 在 2021-06-17 13:11:01,"yidan zhao" <[hidden email]> 写道: > >> >是的,宿主机IP。 > >> > > >> >net.ipv4.tcp_tw_reuse = 1 > >> >net.ipv4.tcp_timestamps = 1 > >> > > >> >东东 <[hidden email]> 于2021年6月17日周四 下午12:52写道: > >> >> > >> >> 10.35.215.18是宿主机IP? > >> >> > >> >> 看一下 tcp_tw_recycle和net.ipv4.tcp_timestamps是什么值 > >> >> 实在不行就 tcpdump 吧 > >> >> > >> >> > >> >> > >> >> 在 2021-06-17 12:41:58,"yidan zhao" <[hidden email]> 写道: > >> >> >@东东 standalone集群。 随机时间,一会一个的,没有固定规律。 和CPU、内存、网络的话有一定规律,但不确认,因为不是很明显。 > >> >> >我排查过几个exception,时间和网络尖刺对上了,但不全能对上,所以不好说是否有这个原因。 > >> >> > > >> >> >此外,有个点我不是很清楚,网上这个报错很少,类似的都是 > >> >> >RemoteTransportException,然后提示中说taskmager可能已丢失之类。但我的是 > >> >> >LocalTransportException,不清楚netty中这俩错误的含义是不是不一样。目前来看网络上关于这俩异常的资料也查不到什么。 > >> >> > > >> >> >东东 <[hidden email]> 于2021年6月17日周四 上午11:19写道: > >> >> >> > >> >> >> 单机standalone,还是Docker/K8s ? > >> >> >> > >> >> >> > >> >> >> > >> >> >> 这个异常出现的时机,与周期性的,还是跟CPU、内存,乃至网络流量变化相关? > >> >> >> > >> >> >> > >> >> >> > >> >> >> 在 2021-06-16 19:10:24,"yidan zhao" <[hidden email]> 写道: > >> >> >> >Hi, yingjie. > >> >> >> >If the network is not stable, which config parameter I should adjust. > >> >> >> > > >> >> >> >yidan zhao <[hidden email]> 于2021年6月16日周三 下午6:56写道: > >> >> >> >> > >> >> >> >> 2: I use G1, and no full gc occurred, young gc count: 422, time: > >> >> >> >> 142892, so it is not bad. > >> >> >> >> 3: stream job. > >> >> >> >> 4: I will try to config taskmanager.network.retries which is default > >> >> >> >> 0, and taskmanager.network.netty.client.connectTimeoutSec 's default > >> >> >> >> is 120s。 > >> >> >> >> 5: I checked the net fd number of the taskmanager, it is about 1000+, > >> >> >> >> so I think it is a reasonable value. > >> >> >> >> > >> >> >> >> 1: can not be sure. > >> >> >> >> > >> >> >> >> Yingjie Cao <[hidden email]> 于2021年6月16日周三 下午4:34写道: > >> >> >> >> > > >> >> >> >> > Hi yidan, > >> >> >> >> > > >> >> >> >> > 1. Is the network stable? > >> >> >> >> > 2. Is there any GC problem? > >> >> >> >> > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information. > >> >> >> >> > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2]. > >> >> >> >> > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node. > >> >> >> >> > > >> >> >> >> > Hope this helps. > >> >> >> >> > > >> >> >> >> > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/ > >> >> >> >> > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/ > >> >> >> >> > [3] https://issues.apache.org/jira/browse/FLINK-22643 > >> >> >> >> > > >> >> >> >> > Best, > >> >> >> >> > Yingjie > >> >> >> >> > > >> >> >> >> > yidan zhao <[hidden email]> 于2021年6月16日周三 下午3:36写道: > >> >> >> >> >> > >> >> >> >> >> Attachment is the exception stack from flink's web-ui. Does anyone > >> >> >> >> >> have also met this problem? > >> >> >> >> >> > >> >> >> >> >> Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers, > >> >> >> >> >> each 28G mem. |
Free forum by Nabble | Edit this page |