Apache Flink 中文用户邮件列表

flink job exception analysis (netty related, readAddress failed. connection timed out)

Classic

List

Threaded

17 messages Options

nobleyd

flink job exception analysis (netty related, readAddress failed. connection timed out)

Attachment is the exception stack from flink's web-ui. Does anyone
have also met this problem?

Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers,
each 28G mem.

Robert Metzger

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Hi Yidan,
it seems that the attachment did not make it through the mailing list. Can
you copy-paste the text of the exception here or upload the log somewhere?

On Wed, Jun 16, 2021 at 9:36 AM yidan zhao <[hidden email]> wrote:

> Attachment is the exception stack from flink's web-ui. Does anyone
> have also met this problem?
>
> Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers,
> each 28G mem.
>

nobleyd

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Hi, here is the text exception stack:

org.apache.flink.runtime.io.network.netty.exception.LocalTransportException:
readAddress(..) failed: Connection timed out (connection to
'10.35.215.18/10.35.215.18:2045')
at org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.exceptionCaught(CreditBasedPartitionRequestClientHandler.java:201)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:273)
at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.exceptionCaught(DefaultChannelPipeline.java:1377)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:907)
at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.handleReadException(AbstractEpollStreamChannel.java:728)
at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:818)
at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:475)
at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException:
readAddress(..) failed: Connection timed out

Robert Metzger <[hidden email]> 于2021年6月16日周三下午4:26写道：

>
> Hi Yidan,
> it seems that the attachment did not make it through the mailing list. Can
> you copy-paste the text of the exception here or upload the log somewhere?
>
>
>
> On Wed, Jun 16, 2021 at 9:36 AM yidan zhao <[hidden email]> wrote:
>
> > Attachment is the exception stack from flink's web-ui. Does anyone
> > have also met this problem?
> >
> > Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers,
> > each 28G mem.
> >

Yingjie Cao

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

In reply to this post by nobleyd

Hi yidan,

1. Is the network stable?
2. Is there any GC problem?
3. Is it a batch job? If so, please use sort-shuffle, see [1] for more
information.
4. You may try to config these two options: taskmanager.network.retries,
taskmanager.network.netty.client.connectTimeoutSec. More relevant options
can be found in 'Data Transport Network Stack' section of [2].
5. If it is not the above cases, it is may related to [3], you may need to
check the number of tcp connection per TM and node.

Hope this helps.

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
[2]
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
[3] https://issues.apache.org/jira/browse/FLINK-22643

Best,
Yingjie

yidan zhao <[hidden email]> 于2021年6月16日周三下午3:36写道：

> Attachment is the exception stack from flink's web-ui. Does anyone
> have also met this problem?
>
> Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers,
> each 28G mem.
>

nobleyd

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2: I use G1, and no full gc occurred, young gc count: 422, time:
142892, so it is not bad.
3: stream job.
4: I will try to config taskmanager.network.retries which is default
0, and taskmanager.network.netty.client.connectTimeoutSec 's default
is 120s。
5: I checked the net fd number of the taskmanager, it is about 1000+,
so I think it is a reasonable value.

1: can not be sure.

Yingjie Cao <[hidden email]> 于2021年6月16日周三下午4:34写道：

>
> Hi yidan,
>
> 1. Is the network stable?
> 2. Is there any GC problem?
> 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
> 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
> 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
>
> Hope this helps.
>
> [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
> [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
> [3] https://issues.apache.org/jira/browse/FLINK-22643
>
> Best,
> Yingjie
>
> yidan zhao <[hidden email]> 于2021年6月16日周三下午3:36写道：
>>
>> Attachment is the exception stack from flink's web-ui. Does anyone
>> have also met this problem?
>>
>> Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers,
>> each 28G mem.

nobleyd

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Hi, yingjie.
If the network is not stable, which config parameter I should adjust.

yidan zhao <[hidden email]> 于2021年6月16日周三下午6:56写道：

>
> 2: I use G1, and no full gc occurred, young gc count: 422, time:
> 142892, so it is not bad.
> 3: stream job.
> 4: I will try to config taskmanager.network.retries which is default
> 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
> is 120s。
> 5: I checked the net fd number of the taskmanager, it is about 1000+,
> so I think it is a reasonable value.
>
> 1: can not be sure.
>
> Yingjie Cao <[hidden email]> 于2021年6月16日周三下午4:34写道：
> >
> > Hi yidan,
> >
> > 1. Is the network stable?
> > 2. Is there any GC problem?
> > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
> > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
> > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
> >
> > Hope this helps.
> >
> > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
> > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
> > [3] https://issues.apache.org/jira/browse/FLINK-22643
> >
> > Best,
> > Yingjie
> >
> > yidan zhao <[hidden email]> 于2021年6月16日周三下午3:36写道：
> >>
> >> Attachment is the exception stack from flink's web-ui. Does anyone
> >> have also met this problem?
> >>
> >> Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers,
> >> each 28G mem.

nobleyd

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

I also searched many result in internet. There are some related
exception like org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException,
but in my case it is
org.apache.flink.runtime.io.network.netty.exception.LocalTransportException.
It is different in 'LocalTransportException' or
'RemoteTransportException'.

yidan zhao <[hidden email]> 于2021年6月16日周三下午7:10写道：

>
> Hi, yingjie.
> If the network is not stable, which config parameter I should adjust.
>
> yidan zhao <[hidden email]> 于2021年6月16日周三下午6:56写道：
> >
> > 2: I use G1, and no full gc occurred, young gc count: 422, time:
> > 142892, so it is not bad.
> > 3: stream job.
> > 4: I will try to config taskmanager.network.retries which is default
> > 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
> > is 120s。
> > 5: I checked the net fd number of the taskmanager, it is about 1000+,
> > so I think it is a reasonable value.
> >
> > 1: can not be sure.
> >
> > Yingjie Cao <[hidden email]> 于2021年6月16日周三下午4:34写道：
> > >
> > > Hi yidan,
> > >
> > > 1. Is the network stable?
> > > 2. Is there any GC problem?
> > > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
> > > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
> > > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
> > >
> > > Hope this helps.
> > >
> > > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
> > > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
> > > [3] https://issues.apache.org/jira/browse/FLINK-22643
> > >
> > > Best,
> > > Yingjie
> > >
> > > yidan zhao <[hidden email]> 于2021年6月16日周三下午3:36写道：
> > >>
> > >> Attachment is the exception stack from flink's web-ui. Does anyone
> > >> have also met this problem?
> > >>
> > >> Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers,
> > >> each 28G mem.

Yingjie Cao

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

In reply to this post by nobleyd

Maybe you can try to
increase taskmanager.network.retries,
taskmanager.network.netty.server.backlog and
taskmanager.network.netty.sendReceiveBufferSize. These options are useful
for our jobs.

yidan zhao <[hidden email]> 于2021年6月16日周三下午7:10写道：

> Hi, yingjie.
> If the network is not stable, which config parameter I should adjust.
>
> yidan zhao <[hidden email]> 于2021年6月16日周三下午6:56写道：
> >
> > 2: I use G1, and no full gc occurred, young gc count: 422, time:
> > 142892, so it is not bad.
> > 3: stream job.
> > 4: I will try to config taskmanager.network.retries which is default
> > 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
> > is 120s。
> > 5: I checked the net fd number of the taskmanager, it is about 1000+,
> > so I think it is a reasonable value.
> >
> > 1: can not be sure.
> >
> > Yingjie Cao <[hidden email]> 于2021年6月16日周三下午4:34写道：
> > >
> > > Hi yidan,
> > >
> > > 1. Is the network stable?
> > > 2. Is there any GC problem?
> > > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more
> information.
> > > 4. You may try to config these two options:
> taskmanager.network.retries,
> taskmanager.network.netty.client.connectTimeoutSec. More relevant options
> can be found in 'Data Transport Network Stack' section of [2].
> > > 5. If it is not the above cases, it is may related to [3], you may
> need to check the number of tcp connection per TM and node.
> > >
> > > Hope this helps.
> > >
> > > [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
> > > [2]
> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
> > > [3] https://issues.apache.org/jira/browse/FLINK-22643
> > >
> > > Best,
> > > Yingjie
> > >
> > > yidan zhao <[hidden email]> 于2021年6月16日周三下午3:36写道：
> > >>
> > >> Attachment is the exception stack from flink's web-ui. Does anyone
> > >> have also met this problem?
> > >>
> > >> Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers,
> > >> each 28G mem.
>

nobleyd

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Ok, I will try.

Yingjie Cao <[hidden email]> 于2021年6月16日周三下午8:00写道：

>
> Maybe you can try to increase taskmanager.network.retries, taskmanager.network.netty.server.backlog and taskmanager.network.netty.sendReceiveBufferSize. These options are useful for our jobs.
>
> yidan zhao <[hidden email]> 于2021年6月16日周三下午7:10写道：
>>
>> Hi, yingjie.
>> If the network is not stable, which config parameter I should adjust.
>>
>> yidan zhao <[hidden email]> 于2021年6月16日周三下午6:56写道：
>> >
>> > 2: I use G1, and no full gc occurred, young gc count: 422, time:
>> > 142892, so it is not bad.
>> > 3: stream job.
>> > 4: I will try to config taskmanager.network.retries which is default
>> > 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
>> > is 120s。
>> > 5: I checked the net fd number of the taskmanager, it is about 1000+,
>> > so I think it is a reasonable value.
>> >
>> > 1: can not be sure.
>> >
>> > Yingjie Cao <[hidden email]> 于2021年6月16日周三下午4:34写道：
>> > >
>> > > Hi yidan,
>> > >
>> > > 1. Is the network stable?
>> > > 2. Is there any GC problem?
>> > > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
>> > > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
>> > > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
>> > >
>> > > Hope this helps.
>> > >
>> > > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
>> > > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
>> > > [3] https://issues.apache.org/jira/browse/FLINK-22643
>> > >
>> > > Best,
>> > > Yingjie
>> > >
>> > > yidan zhao <[hidden email]> 于2021年6月16日周三下午3:36写道：
>> > >>
>> > >> Attachment is the exception stack from flink's web-ui. Does anyone
>> > >> have also met this problem?
>> > >>
>> > >> Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers,
>> > >> each 28G mem.

东东

Re:Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

In reply to this post by nobleyd

单机standalone，还是Docker/K8s ?

这个异常出现的时机，与周期性的，还是跟CPU、内存，乃至网络流量变化相关？

在 2021-06-16 19:10:24，"yidan zhao" <[hidden email]> 写道：

>Hi, yingjie.
>If the network is not stable, which config parameter I should adjust.
>
>yidan zhao <[hidden email]> 于2021年6月16日周三下午6:56写道：
>>
>> 2: I use G1, and no full gc occurred, young gc count: 422, time:
>> 142892, so it is not bad.
>> 3: stream job.
>> 4: I will try to config taskmanager.network.retries which is default
>> 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
>> is 120s。
>> 5: I checked the net fd number of the taskmanager, it is about 1000+,
>> so I think it is a reasonable value.
>>
>> 1: can not be sure.
>>
>> Yingjie Cao <[hidden email]> 于2021年6月16日周三下午4:34写道：
>> >
>> > Hi yidan,
>> >
>> > 1. Is the network stable?
>> > 2. Is there any GC problem?
>> > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
>> > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
>> > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
>> >
>> > Hope this helps.
>> >
>> > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
>> > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
>> > [3] https://issues.apache.org/jira/browse/FLINK-22643
>> >
>> > Best,
>> > Yingjie
>> >
>> > yidan zhao <[hidden email]> 于2021年6月16日周三下午3:36写道：
>> >>
>> >> Attachment is the exception stack from flink's web-ui. Does anyone
>> >> have also met this problem?
>> >>
>> >> Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers,
>> >> each 28G mem.

nobleyd

Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

@东东 standalone集群。随机时间，一会一个的，没有固定规律。和CPU、内存、网络的话有一定规律，但不确认，因为不是很明显。
我排查过几个exception，时间和网络尖刺对上了，但不全能对上，所以不好说是否有这个原因。

此外，有个点我不是很清楚，网上这个报错很少，类似的都是
RemoteTransportException，然后提示中说taskmager可能已丢失之类。但我的是
LocalTransportException，不清楚netty中这俩错误的含义是不是不一样。目前来看网络上关于这俩异常的资料也查不到什么。

东东 <[hidden email]> 于2021年6月17日周四上午11:19写道：

>
> 单机standalone，还是Docker/K8s ?
>
>
>
> 这个异常出现的时机，与周期性的，还是跟CPU、内存，乃至网络流量变化相关？
>
>
>
> 在 2021-06-16 19:10:24，"yidan zhao" <[hidden email]> 写道：
> >Hi, yingjie.
> >If the network is not stable, which config parameter I should adjust.
> >
> >yidan zhao <[hidden email]> 于2021年6月16日周三下午6:56写道：
> >>
> >> 2: I use G1, and no full gc occurred, young gc count: 422, time:
> >> 142892, so it is not bad.
> >> 3: stream job.
> >> 4: I will try to config taskmanager.network.retries which is default
> >> 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
> >> is 120s。
> >> 5: I checked the net fd number of the taskmanager, it is about 1000+,
> >> so I think it is a reasonable value.
> >>
> >> 1: can not be sure.
> >>
> >> Yingjie Cao <[hidden email]> 于2021年6月16日周三下午4:34写道：
> >> >
> >> > Hi yidan,
> >> >
> >> > 1. Is the network stable?
> >> > 2. Is there any GC problem?
> >> > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
> >> > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
> >> > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
> >> >
> >> > Hope this helps.
> >> >
> >> > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
> >> > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
> >> > [3] https://issues.apache.org/jira/browse/FLINK-22643
> >> >
> >> > Best,
> >> > Yingjie
> >> >
> >> > yidan zhao <[hidden email]> 于2021年6月16日周三下午3:36写道：
> >> >>
> >> >> Attachment is the exception stack from flink's web-ui. Does anyone
> >> >> have also met this problem?
> >> >>
> >> >> Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers,
> >> >> each 28G mem.

东东

Re:Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

10.35.215.18是宿主机IP？

看一下 tcp_tw_recycle和net.ipv4.tcp_timestamps是什么值
实在不行就 tcpdump 吧

在 2021-06-17 12:41:58，"yidan zhao" <[hidden email]> 写道：

>@东东 standalone集群。随机时间，一会一个的，没有固定规律。和CPU、内存、网络的话有一定规律，但不确认，因为不是很明显。
>我排查过几个exception，时间和网络尖刺对上了，但不全能对上，所以不好说是否有这个原因。
>
>此外，有个点我不是很清楚，网上这个报错很少，类似的都是
>RemoteTransportException，然后提示中说taskmager可能已丢失之类。但我的是
>LocalTransportException，不清楚netty中这俩错误的含义是不是不一样。目前来看网络上关于这俩异常的资料也查不到什么。
>
>东东 <[hidden email]> 于2021年6月17日周四上午11:19写道：
>>
>> 单机standalone，还是Docker/K8s ?
>>
>>
>>
>> 这个异常出现的时机，与周期性的，还是跟CPU、内存，乃至网络流量变化相关？
>>
>>
>>
>> 在 2021-06-16 19:10:24，"yidan zhao" <[hidden email]> 写道：
>> >Hi, yingjie.
>> >If the network is not stable, which config parameter I should adjust.
>> >
>> >yidan zhao <[hidden email]> 于2021年6月16日周三下午6:56写道：
>> >>
>> >> 2: I use G1, and no full gc occurred, young gc count: 422, time:
>> >> 142892, so it is not bad.
>> >> 3: stream job.
>> >> 4: I will try to config taskmanager.network.retries which is default
>> >> 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
>> >> is 120s。
>> >> 5: I checked the net fd number of the taskmanager, it is about 1000+,
>> >> so I think it is a reasonable value.
>> >>
>> >> 1: can not be sure.
>> >>
>> >> Yingjie Cao <[hidden email]> 于2021年6月16日周三下午4:34写道：
>> >> >
>> >> > Hi yidan,
>> >> >
>> >> > 1. Is the network stable?
>> >> > 2. Is there any GC problem?
>> >> > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
>> >> > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
>> >> > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
>> >> >
>> >> > Hope this helps.
>> >> >
>> >> > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
>> >> > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
>> >> > [3] https://issues.apache.org/jira/browse/FLINK-22643
>> >> >
>> >> > Best,
>> >> > Yingjie
>> >> >
>> >> > yidan zhao <[hidden email]> 于2021年6月16日周三下午3:36写道：
>> >> >>
>> >> >> Attachment is the exception stack from flink's web-ui. Does anyone
>> >> >> have also met this problem?
>> >> >>
>> >> >> Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers,
>> >> >> each 28G mem.

nobleyd

Re: Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

是的，宿主机IP。

net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_timestamps = 1

东东 <[hidden email]> 于2021年6月17日周四下午12:52写道：

>
> 10.35.215.18是宿主机IP？
>
> 看一下 tcp_tw_recycle和net.ipv4.tcp_timestamps是什么值
> 实在不行就 tcpdump 吧
>
>
>
> 在 2021-06-17 12:41:58，"yidan zhao" <[hidden email]> 写道：
> >@东东 standalone集群。随机时间，一会一个的，没有固定规律。和CPU、内存、网络的话有一定规律，但不确认，因为不是很明显。
> >我排查过几个exception，时间和网络尖刺对上了，但不全能对上，所以不好说是否有这个原因。
> >
> >此外，有个点我不是很清楚，网上这个报错很少，类似的都是
> >RemoteTransportException，然后提示中说taskmager可能已丢失之类。但我的是
> >LocalTransportException，不清楚netty中这俩错误的含义是不是不一样。目前来看网络上关于这俩异常的资料也查不到什么。
> >
> >东东 <[hidden email]> 于2021年6月17日周四上午11:19写道：
> >>
> >> 单机standalone，还是Docker/K8s ?
> >>
> >>
> >>
> >> 这个异常出现的时机，与周期性的，还是跟CPU、内存，乃至网络流量变化相关？
> >>
> >>
> >>
> >> 在 2021-06-16 19:10:24，"yidan zhao" <[hidden email]> 写道：
> >> >Hi, yingjie.
> >> >If the network is not stable, which config parameter I should adjust.
> >> >
> >> >yidan zhao <[hidden email]> 于2021年6月16日周三下午6:56写道：
> >> >>
> >> >> 2: I use G1, and no full gc occurred, young gc count: 422, time:
> >> >> 142892, so it is not bad.
> >> >> 3: stream job.
> >> >> 4: I will try to config taskmanager.network.retries which is default
> >> >> 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
> >> >> is 120s。
> >> >> 5: I checked the net fd number of the taskmanager, it is about 1000+,
> >> >> so I think it is a reasonable value.
> >> >>
> >> >> 1: can not be sure.
> >> >>
> >> >> Yingjie Cao <[hidden email]> 于2021年6月16日周三下午4:34写道：
> >> >> >
> >> >> > Hi yidan,
> >> >> >
> >> >> > 1. Is the network stable?
> >> >> > 2. Is there any GC problem?
> >> >> > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
> >> >> > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
> >> >> > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
> >> >> >
> >> >> > Hope this helps.
> >> >> >
> >> >> > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
> >> >> > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
> >> >> > [3] https://issues.apache.org/jira/browse/FLINK-22643
> >> >> >
> >> >> > Best,
> >> >> > Yingjie
> >> >> >
> >> >> > yidan zhao <[hidden email]> 于2021年6月16日周三下午3:36写道：
> >> >> >>
> >> >> >> Attachment is the exception stack from flink's web-ui. Does anyone
> >> >> >> have also met this problem?
> >> >> >>
> >> >> >> Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers,
> >> >> >> each 28G mem.

东东

Re:Re: Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

把其中一个改成0

在 2021-06-17 13:11:01，"yidan zhao" <[hidden email]> 写道：

>是的，宿主机IP。
>
>net.ipv4.tcp_tw_reuse = 1
>net.ipv4.tcp_timestamps = 1
>
>东东 <[hidden email]> 于2021年6月17日周四下午12:52写道：
>>
>> 10.35.215.18是宿主机IP？
>>
>> 看一下 tcp_tw_recycle和net.ipv4.tcp_timestamps是什么值
>> 实在不行就 tcpdump 吧
>>
>>
>>
>> 在 2021-06-17 12:41:58，"yidan zhao" <[hidden email]> 写道：
>> >@东东 standalone集群。随机时间，一会一个的，没有固定规律。和CPU、内存、网络的话有一定规律，但不确认，因为不是很明显。
>> >我排查过几个exception，时间和网络尖刺对上了，但不全能对上，所以不好说是否有这个原因。
>> >
>> >此外，有个点我不是很清楚，网上这个报错很少，类似的都是
>> >RemoteTransportException，然后提示中说taskmager可能已丢失之类。但我的是
>> >LocalTransportException，不清楚netty中这俩错误的含义是不是不一样。目前来看网络上关于这俩异常的资料也查不到什么。
>> >
>> >东东 <[hidden email]> 于2021年6月17日周四上午11:19写道：
>> >>
>> >> 单机standalone，还是Docker/K8s ?
>> >>
>> >>
>> >>
>> >> 这个异常出现的时机，与周期性的，还是跟CPU、内存，乃至网络流量变化相关？
>> >>
>> >>
>> >>
>> >> 在 2021-06-16 19:10:24，"yidan zhao" <[hidden email]> 写道：
>> >> >Hi, yingjie.
>> >> >If the network is not stable, which config parameter I should adjust.
>> >> >
>> >> >yidan zhao <[hidden email]> 于2021年6月16日周三下午6:56写道：
>> >> >>
>> >> >> 2: I use G1, and no full gc occurred, young gc count: 422, time:
>> >> >> 142892, so it is not bad.
>> >> >> 3: stream job.
>> >> >> 4: I will try to config taskmanager.network.retries which is default
>> >> >> 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
>> >> >> is 120s。
>> >> >> 5: I checked the net fd number of the taskmanager, it is about 1000+,
>> >> >> so I think it is a reasonable value.
>> >> >>
>> >> >> 1: can not be sure.
>> >> >>
>> >> >> Yingjie Cao <[hidden email]> 于2021年6月16日周三下午4:34写道：
>> >> >> >
>> >> >> > Hi yidan,
>> >> >> >
>> >> >> > 1. Is the network stable?
>> >> >> > 2. Is there any GC problem?
>> >> >> > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
>> >> >> > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
>> >> >> > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
>> >> >> >
>> >> >> > Hope this helps.
>> >> >> >
>> >> >> > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
>> >> >> > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
>> >> >> > [3] https://issues.apache.org/jira/browse/FLINK-22643
>> >> >> >
>> >> >> > Best,
>> >> >> > Yingjie
>> >> >> >
>> >> >> > yidan zhao <[hidden email]> 于2021年6月16日周三下午3:36写道：
>> >> >> >>
>> >> >> >> Attachment is the exception stack from flink's web-ui. Does anyone
>> >> >> >> have also met this problem?
>> >> >> >>
>> >> >> >> Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers,
>> >> >> >> each 28G mem.

nobleyd

Re: Re: Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

这啥原理，这个改动我没办法直接改，需要申请。

东东 <[hidden email]> 于2021年6月17日周四下午1:36写道：

>
>
>
> 把其中一个改成0
>
>
> 在 2021-06-17 13:11:01，"yidan zhao" <[hidden email]> 写道：
> >是的，宿主机IP。
> >
> >net.ipv4.tcp_tw_reuse = 1
> >net.ipv4.tcp_timestamps = 1
> >
> >东东 <[hidden email]> 于2021年6月17日周四下午12:52写道：
> >>
> >> 10.35.215.18是宿主机IP？
> >>
> >> 看一下 tcp_tw_recycle和net.ipv4.tcp_timestamps是什么值
> >> 实在不行就 tcpdump 吧
> >>
> >>
> >>
> >> 在 2021-06-17 12:41:58，"yidan zhao" <[hidden email]> 写道：
> >> >@东东 standalone集群。随机时间，一会一个的，没有固定规律。和CPU、内存、网络的话有一定规律，但不确认，因为不是很明显。
> >> >我排查过几个exception，时间和网络尖刺对上了，但不全能对上，所以不好说是否有这个原因。
> >> >
> >> >此外，有个点我不是很清楚，网上这个报错很少，类似的都是
> >> >RemoteTransportException，然后提示中说taskmager可能已丢失之类。但我的是
> >> >LocalTransportException，不清楚netty中这俩错误的含义是不是不一样。目前来看网络上关于这俩异常的资料也查不到什么。
> >> >
> >> >东东 <[hidden email]> 于2021年6月17日周四上午11:19写道：
> >> >>
> >> >> 单机standalone，还是Docker/K8s ?
> >> >>
> >> >>
> >> >>
> >> >> 这个异常出现的时机，与周期性的，还是跟CPU、内存，乃至网络流量变化相关？
> >> >>
> >> >>
> >> >>
> >> >> 在 2021-06-16 19:10:24，"yidan zhao" <[hidden email]> 写道：
> >> >> >Hi, yingjie.
> >> >> >If the network is not stable, which config parameter I should adjust.
> >> >> >
> >> >> >yidan zhao <[hidden email]> 于2021年6月16日周三下午6:56写道：
> >> >> >>
> >> >> >> 2: I use G1, and no full gc occurred, young gc count: 422, time:
> >> >> >> 142892, so it is not bad.
> >> >> >> 3: stream job.
> >> >> >> 4: I will try to config taskmanager.network.retries which is default
> >> >> >> 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
> >> >> >> is 120s。
> >> >> >> 5: I checked the net fd number of the taskmanager, it is about 1000+,
> >> >> >> so I think it is a reasonable value.
> >> >> >>
> >> >> >> 1: can not be sure.
> >> >> >>
> >> >> >> Yingjie Cao <[hidden email]> 于2021年6月16日周三下午4:34写道：
> >> >> >> >
> >> >> >> > Hi yidan,
> >> >> >> >
> >> >> >> > 1. Is the network stable?
> >> >> >> > 2. Is there any GC problem?
> >> >> >> > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
> >> >> >> > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
> >> >> >> > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
> >> >> >> >
> >> >> >> > Hope this helps.
> >> >> >> >
> >> >> >> > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
> >> >> >> > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
> >> >> >> > [3] https://issues.apache.org/jira/browse/FLINK-22643
> >> >> >> >
> >> >> >> > Best,
> >> >> >> > Yingjie
> >> >> >> >
> >> >> >> > yidan zhao <[hidden email]> 于2021年6月16日周三下午3:36写道：
> >> >> >> >>
> >> >> >> >> Attachment is the exception stack from flink's web-ui. Does anyone
> >> >> >> >> have also met this problem?
> >> >> >> >>
> >> >> >> >> Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers,
> >> >> >> >> each 28G mem.

东东

Re:Re: Re: Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

这俩都开启的话，就要求同一源ip的连接请求中的timstamp必须是递增的，否则(非递增)的连接请求被视为无效，数据包会被抛弃，给client端的感觉就是时不时的连接超时。

一般来说单机不会有这个问题，因为时钟应该是一个，在NAT后面才容易出现这个现象(因为多个主机时钟通常不完全一致)，但不清楚你的具体架构，只能说试一试。

最后，可以跟运维讨论一下，除非确信不会有经过NAT过来的链接，否则这俩最好别都开。

PS： kernel 4.1里面已经把 tcp_tw_reuse 这玩意废掉了，因为太多人掉这坑里了

在 2021-06-17 14:07:50，"yidan zhao" <[hidden email]> 写道：

>这啥原理，这个改动我没办法直接改，需要申请。
>
>东东 <[hidden email]> 于2021年6月17日周四下午1:36写道：
>>
>>
>>
>> 把其中一个改成0
>>
>>
>> 在 2021-06-17 13:11:01，"yidan zhao" <[hidden email]> 写道：
>> >是的，宿主机IP。
>> >
>> >net.ipv4.tcp_tw_reuse = 1
>> >net.ipv4.tcp_timestamps = 1
>> >
>> >东东 <[hidden email]> 于2021年6月17日周四下午12:52写道：
>> >>
>> >> 10.35.215.18是宿主机IP？
>> >>
>> >> 看一下 tcp_tw_recycle和net.ipv4.tcp_timestamps是什么值
>> >> 实在不行就 tcpdump 吧
>> >>
>> >>
>> >>
>> >> 在 2021-06-17 12:41:58，"yidan zhao" <[hidden email]> 写道：
>> >> >@东东 standalone集群。随机时间，一会一个的，没有固定规律。和CPU、内存、网络的话有一定规律，但不确认，因为不是很明显。
>> >> >我排查过几个exception，时间和网络尖刺对上了，但不全能对上，所以不好说是否有这个原因。
>> >> >
>> >> >此外，有个点我不是很清楚，网上这个报错很少，类似的都是
>> >> >RemoteTransportException，然后提示中说taskmager可能已丢失之类。但我的是
>> >> >LocalTransportException，不清楚netty中这俩错误的含义是不是不一样。目前来看网络上关于这俩异常的资料也查不到什么。
>> >> >
>> >> >东东 <[hidden email]> 于2021年6月17日周四上午11:19写道：
>> >> >>
>> >> >> 单机standalone，还是Docker/K8s ?
>> >> >>
>> >> >>
>> >> >>
>> >> >> 这个异常出现的时机，与周期性的，还是跟CPU、内存，乃至网络流量变化相关？
>> >> >>
>> >> >>
>> >> >>
>> >> >> 在 2021-06-16 19:10:24，"yidan zhao" <[hidden email]> 写道：
>> >> >> >Hi, yingjie.
>> >> >> >If the network is not stable, which config parameter I should adjust.
>> >> >> >
>> >> >> >yidan zhao <[hidden email]> 于2021年6月16日周三下午6:56写道：
>> >> >> >>
>> >> >> >> 2: I use G1, and no full gc occurred, young gc count: 422, time:
>> >> >> >> 142892, so it is not bad.
>> >> >> >> 3: stream job.
>> >> >> >> 4: I will try to config taskmanager.network.retries which is default
>> >> >> >> 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
>> >> >> >> is 120s。
>> >> >> >> 5: I checked the net fd number of the taskmanager, it is about 1000+,
>> >> >> >> so I think it is a reasonable value.
>> >> >> >>
>> >> >> >> 1: can not be sure.
>> >> >> >>
>> >> >> >> Yingjie Cao <[hidden email]> 于2021年6月16日周三下午4:34写道：
>> >> >> >> >
>> >> >> >> > Hi yidan,
>> >> >> >> >
>> >> >> >> > 1. Is the network stable?
>> >> >> >> > 2. Is there any GC problem?
>> >> >> >> > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
>> >> >> >> > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
>> >> >> >> > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
>> >> >> >> >
>> >> >> >> > Hope this helps.
>> >> >> >> >
>> >> >> >> > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
>> >> >> >> > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
>> >> >> >> > [3] https://issues.apache.org/jira/browse/FLINK-22643
>> >> >> >> >
>> >> >> >> > Best,
>> >> >> >> > Yingjie
>> >> >> >> >
>> >> >> >> > yidan zhao <[hidden email]> 于2021年6月16日周三下午3:36写道：
>> >> >> >> >>
>> >> >> >> >> Attachment is the exception stack from flink's web-ui. Does anyone
>> >> >> >> >> have also met this problem?
>> >> >> >> >>
>> >> >> >> >> Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers,
>> >> >> >> >> each 28G mem.

nobleyd

Re: Re: Re: Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

我仔细想了想，我的集群是内网服务器上的容器，容器之间访问应该不算经过NAT。

当然和网络相关的监控来看，的确很多机器的time-wait状态的连接不少，在5w+个左右，但也不至于导致这个问题感觉。

东东 <[hidden email]> 于2021年6月17日周四下午2:48写道：

>
> 这俩都开启的话，就要求同一源ip的连接请求中的timstamp必须是递增的，否则(非递增)的连接请求被视为无效，数据包会被抛弃，给client端的感觉就是时不时的连接超时。
>
>
>
> 一般来说单机不会有这个问题，因为时钟应该是一个，在NAT后面才容易出现这个现象(因为多个主机时钟通常不完全一致)，但不清楚你的具体架构，只能说试一试。
>
>
> 最后，可以跟运维讨论一下，除非确信不会有经过NAT过来的链接，否则这俩最好别都开。
>
>
> PS： kernel 4.1里面已经把 tcp_tw_reuse 这玩意废掉了，因为太多人掉这坑里了
>
>
> 在 2021-06-17 14:07:50，"yidan zhao" <[hidden email]> 写道：
> >这啥原理，这个改动我没办法直接改，需要申请。
> >
> >东东 <[hidden email]> 于2021年6月17日周四下午1:36写道：
> >>
> >>
> >>
> >> 把其中一个改成0
> >>
> >>
> >> 在 2021-06-17 13:11:01，"yidan zhao" <[hidden email]> 写道：
> >> >是的，宿主机IP。
> >> >
> >> >net.ipv4.tcp_tw_reuse = 1
> >> >net.ipv4.tcp_timestamps = 1
> >> >
> >> >东东 <[hidden email]> 于2021年6月17日周四下午12:52写道：
> >> >>
> >> >> 10.35.215.18是宿主机IP？
> >> >>
> >> >> 看一下 tcp_tw_recycle和net.ipv4.tcp_timestamps是什么值
> >> >> 实在不行就 tcpdump 吧
> >> >>
> >> >>
> >> >>
> >> >> 在 2021-06-17 12:41:58，"yidan zhao" <[hidden email]> 写道：
> >> >> >@东东 standalone集群。随机时间，一会一个的，没有固定规律。和CPU、内存、网络的话有一定规律，但不确认，因为不是很明显。
> >> >> >我排查过几个exception，时间和网络尖刺对上了，但不全能对上，所以不好说是否有这个原因。
> >> >> >
> >> >> >此外，有个点我不是很清楚，网上这个报错很少，类似的都是
> >> >> >RemoteTransportException，然后提示中说taskmager可能已丢失之类。但我的是
> >> >> >LocalTransportException，不清楚netty中这俩错误的含义是不是不一样。目前来看网络上关于这俩异常的资料也查不到什么。
> >> >> >
> >> >> >东东 <[hidden email]> 于2021年6月17日周四上午11:19写道：
> >> >> >>
> >> >> >> 单机standalone，还是Docker/K8s ?
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> 这个异常出现的时机，与周期性的，还是跟CPU、内存，乃至网络流量变化相关？
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> 在 2021-06-16 19:10:24，"yidan zhao" <[hidden email]> 写道：
> >> >> >> >Hi, yingjie.
> >> >> >> >If the network is not stable, which config parameter I should adjust.
> >> >> >> >
> >> >> >> >yidan zhao <[hidden email]> 于2021年6月16日周三下午6:56写道：
> >> >> >> >>
> >> >> >> >> 2: I use G1, and no full gc occurred, young gc count: 422, time:
> >> >> >> >> 142892, so it is not bad.
> >> >> >> >> 3: stream job.
> >> >> >> >> 4: I will try to config taskmanager.network.retries which is default
> >> >> >> >> 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
> >> >> >> >> is 120s。
> >> >> >> >> 5: I checked the net fd number of the taskmanager, it is about 1000+,
> >> >> >> >> so I think it is a reasonable value.
> >> >> >> >>
> >> >> >> >> 1: can not be sure.
> >> >> >> >>
> >> >> >> >> Yingjie Cao <[hidden email]> 于2021年6月16日周三下午4:34写道：
> >> >> >> >> >
> >> >> >> >> > Hi yidan,
> >> >> >> >> >
> >> >> >> >> > 1. Is the network stable?
> >> >> >> >> > 2. Is there any GC problem?
> >> >> >> >> > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
> >> >> >> >> > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
> >> >> >> >> > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
> >> >> >> >> >
> >> >> >> >> > Hope this helps.
> >> >> >> >> >
> >> >> >> >> > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
> >> >> >> >> > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
> >> >> >> >> > [3] https://issues.apache.org/jira/browse/FLINK-22643
> >> >> >> >> >
> >> >> >> >> > Best,
> >> >> >> >> > Yingjie
> >> >> >> >> >
> >> >> >> >> > yidan zhao <[hidden email]> 于2021年6月16日周三下午3:36写道：
> >> >> >> >> >>
> >> >> >> >> >> Attachment is the exception stack from flink's web-ui. Does anyone
> >> >> >> >> >> have also met this problem?
> >> >> >> >> >>
> >> >> >> >> >> Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers,
> >> >> >> >> >> each 28G mem.