flink job exception analysis (netty related, readAddress failed. connection timed out)

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

flink job exception analysis (netty related, readAddress failed. connection timed out)

nobleyd
Attachment is the exception stack from flink's web-ui. Does anyone
have also met this problem?

Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
each 28G mem.
Reply | Threaded
Open this post in threaded view
|

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Robert Metzger
Hi Yidan,
it seems that the attachment did not make it through the mailing list. Can
you copy-paste the text of the exception here or upload the log somewhere?



On Wed, Jun 16, 2021 at 9:36 AM yidan zhao <[hidden email]> wrote:

> Attachment is the exception stack from flink's web-ui. Does anyone
> have also met this problem?
>
> Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
> each 28G mem.
>
Reply | Threaded
Open this post in threaded view
|

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

nobleyd
Hi, here is the text exception stack:

org.apache.flink.runtime.io.network.netty.exception.LocalTransportException:
readAddress(..) failed: Connection timed out (connection to
'10.35.215.18/10.35.215.18:2045')
    at org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.exceptionCaught(CreditBasedPartitionRequestClientHandler.java:201)
    at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
    at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
    at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:273)
    at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.exceptionCaught(DefaultChannelPipeline.java:1377)
    at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
    at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
    at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:907)
    at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.handleReadException(AbstractEpollStreamChannel.java:728)
    at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:818)
    at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:475)
    at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
    at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException:
readAddress(..) failed: Connection timed out

Robert Metzger <[hidden email]> 于2021年6月16日周三 下午4:26写道:

>
> Hi Yidan,
> it seems that the attachment did not make it through the mailing list. Can
> you copy-paste the text of the exception here or upload the log somewhere?
>
>
>
> On Wed, Jun 16, 2021 at 9:36 AM yidan zhao <[hidden email]> wrote:
>
> > Attachment is the exception stack from flink's web-ui. Does anyone
> > have also met this problem?
> >
> > Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
> > each 28G mem.
> >
Reply | Threaded
Open this post in threaded view
|

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Yingjie Cao
In reply to this post by nobleyd
Hi yidan,

1. Is the network stable?
2. Is there any GC problem?
3. Is it a batch job? If so, please use sort-shuffle, see [1] for more
information.
4. You may try to config these two options: taskmanager.network.retries,
taskmanager.network.netty.client.connectTimeoutSec. More relevant options
can be found in 'Data Transport Network Stack' section of [2].
5. If it is not the above cases, it is may related to [3], you may need to
check the number of tcp connection per TM and node.

Hope this helps.

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
[2]
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
[3] https://issues.apache.org/jira/browse/FLINK-22643

Best,
Yingjie

yidan zhao <[hidden email]> 于2021年6月16日周三 下午3:36写道:

> Attachment is the exception stack from flink's web-ui. Does anyone
> have also met this problem?
>
> Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
> each 28G mem.
>
Reply | Threaded
Open this post in threaded view
|

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

nobleyd
2: I use G1, and no full gc occurred, young gc count: 422, time:
142892, so it is not bad.
3: stream job.
4: I will try to config taskmanager.network.retries which is default
0, and taskmanager.network.netty.client.connectTimeoutSec 's default
is 120s。
5: I checked the net fd number of the taskmanager, it is about 1000+,
so I think it is a reasonable value.

1: can not be sure.

Yingjie Cao <[hidden email]> 于2021年6月16日周三 下午4:34写道:

>
> Hi yidan,
>
> 1. Is the network stable?
> 2. Is there any GC problem?
> 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
> 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
> 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
>
> Hope this helps.
>
> [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
> [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
> [3] https://issues.apache.org/jira/browse/FLINK-22643
>
> Best,
> Yingjie
>
> yidan zhao <[hidden email]> 于2021年6月16日周三 下午3:36写道:
>>
>> Attachment is the exception stack from flink's web-ui. Does anyone
>> have also met this problem?
>>
>> Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
>> each 28G mem.
Reply | Threaded
Open this post in threaded view
|

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

nobleyd
Hi, yingjie.
If the network is not stable, which config parameter I should adjust.

yidan zhao <[hidden email]> 于2021年6月16日周三 下午6:56写道:

>
> 2: I use G1, and no full gc occurred, young gc count: 422, time:
> 142892, so it is not bad.
> 3: stream job.
> 4: I will try to config taskmanager.network.retries which is default
> 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
> is 120s。
> 5: I checked the net fd number of the taskmanager, it is about 1000+,
> so I think it is a reasonable value.
>
> 1: can not be sure.
>
> Yingjie Cao <[hidden email]> 于2021年6月16日周三 下午4:34写道:
> >
> > Hi yidan,
> >
> > 1. Is the network stable?
> > 2. Is there any GC problem?
> > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
> > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
> > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
> >
> > Hope this helps.
> >
> > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
> > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
> > [3] https://issues.apache.org/jira/browse/FLINK-22643
> >
> > Best,
> > Yingjie
> >
> > yidan zhao <[hidden email]> 于2021年6月16日周三 下午3:36写道:
> >>
> >> Attachment is the exception stack from flink's web-ui. Does anyone
> >> have also met this problem?
> >>
> >> Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
> >> each 28G mem.
Reply | Threaded
Open this post in threaded view
|

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

nobleyd
I also searched many result in internet. There are some related
exception like org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException,
but in my case it is
org.apache.flink.runtime.io.network.netty.exception.LocalTransportException.
It is different in 'LocalTransportException' or
'RemoteTransportException'.

yidan zhao <[hidden email]> 于2021年6月16日周三 下午7:10写道:

>
> Hi, yingjie.
> If the network is not stable, which config parameter I should adjust.
>
> yidan zhao <[hidden email]> 于2021年6月16日周三 下午6:56写道:
> >
> > 2: I use G1, and no full gc occurred, young gc count: 422, time:
> > 142892, so it is not bad.
> > 3: stream job.
> > 4: I will try to config taskmanager.network.retries which is default
> > 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
> > is 120s。
> > 5: I checked the net fd number of the taskmanager, it is about 1000+,
> > so I think it is a reasonable value.
> >
> > 1: can not be sure.
> >
> > Yingjie Cao <[hidden email]> 于2021年6月16日周三 下午4:34写道:
> > >
> > > Hi yidan,
> > >
> > > 1. Is the network stable?
> > > 2. Is there any GC problem?
> > > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
> > > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
> > > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
> > >
> > > Hope this helps.
> > >
> > > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
> > > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
> > > [3] https://issues.apache.org/jira/browse/FLINK-22643
> > >
> > > Best,
> > > Yingjie
> > >
> > > yidan zhao <[hidden email]> 于2021年6月16日周三 下午3:36写道:
> > >>
> > >> Attachment is the exception stack from flink's web-ui. Does anyone
> > >> have also met this problem?
> > >>
> > >> Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
> > >> each 28G mem.
Reply | Threaded
Open this post in threaded view
|

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Yingjie Cao
In reply to this post by nobleyd
Maybe you can try to
increase taskmanager.network.retries,
taskmanager.network.netty.server.backlog and
taskmanager.network.netty.sendReceiveBufferSize. These options are useful
for our jobs.

yidan zhao <[hidden email]> 于2021年6月16日周三 下午7:10写道:

> Hi, yingjie.
> If the network is not stable, which config parameter I should adjust.
>
> yidan zhao <[hidden email]> 于2021年6月16日周三 下午6:56写道:
> >
> > 2: I use G1, and no full gc occurred, young gc count: 422, time:
> > 142892, so it is not bad.
> > 3: stream job.
> > 4: I will try to config taskmanager.network.retries which is default
> > 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
> > is 120s。
> > 5: I checked the net fd number of the taskmanager, it is about 1000+,
> > so I think it is a reasonable value.
> >
> > 1: can not be sure.
> >
> > Yingjie Cao <[hidden email]> 于2021年6月16日周三 下午4:34写道:
> > >
> > > Hi yidan,
> > >
> > > 1. Is the network stable?
> > > 2. Is there any GC problem?
> > > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more
> information.
> > > 4. You may try to config these two options:
> taskmanager.network.retries,
> taskmanager.network.netty.client.connectTimeoutSec. More relevant options
> can be found in 'Data Transport Network Stack' section of [2].
> > > 5. If it is not the above cases, it is may related to [3], you may
> need to check the number of tcp connection per TM and node.
> > >
> > > Hope this helps.
> > >
> > > [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
> > > [2]
> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
> > > [3] https://issues.apache.org/jira/browse/FLINK-22643
> > >
> > > Best,
> > > Yingjie
> > >
> > > yidan zhao <[hidden email]> 于2021年6月16日周三 下午3:36写道:
> > >>
> > >> Attachment is the exception stack from flink's web-ui. Does anyone
> > >> have also met this problem?
> > >>
> > >> Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
> > >> each 28G mem.
>
Reply | Threaded
Open this post in threaded view
|

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

nobleyd
Ok, I will try.

Yingjie Cao <[hidden email]> 于2021年6月16日周三 下午8:00写道:

>
> Maybe you can try to increase taskmanager.network.retries, taskmanager.network.netty.server.backlog and taskmanager.network.netty.sendReceiveBufferSize. These options are useful for our jobs.
>
> yidan zhao <[hidden email]> 于2021年6月16日周三 下午7:10写道:
>>
>> Hi, yingjie.
>> If the network is not stable, which config parameter I should adjust.
>>
>> yidan zhao <[hidden email]> 于2021年6月16日周三 下午6:56写道:
>> >
>> > 2: I use G1, and no full gc occurred, young gc count: 422, time:
>> > 142892, so it is not bad.
>> > 3: stream job.
>> > 4: I will try to config taskmanager.network.retries which is default
>> > 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
>> > is 120s。
>> > 5: I checked the net fd number of the taskmanager, it is about 1000+,
>> > so I think it is a reasonable value.
>> >
>> > 1: can not be sure.
>> >
>> > Yingjie Cao <[hidden email]> 于2021年6月16日周三 下午4:34写道:
>> > >
>> > > Hi yidan,
>> > >
>> > > 1. Is the network stable?
>> > > 2. Is there any GC problem?
>> > > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
>> > > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
>> > > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
>> > >
>> > > Hope this helps.
>> > >
>> > > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
>> > > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
>> > > [3] https://issues.apache.org/jira/browse/FLINK-22643
>> > >
>> > > Best,
>> > > Yingjie
>> > >
>> > > yidan zhao <[hidden email]> 于2021年6月16日周三 下午3:36写道:
>> > >>
>> > >> Attachment is the exception stack from flink's web-ui. Does anyone
>> > >> have also met this problem?
>> > >>
>> > >> Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
>> > >> each 28G mem.
Reply | Threaded
Open this post in threaded view
|

Re:Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

东东
In reply to this post by nobleyd
单机standalone,还是Docker/K8s ?



这个异常出现的时机,与周期性的,还是跟CPU、内存,乃至网络流量变化相关?



在 2021-06-16 19:10:24,"yidan zhao" <[hidden email]> 写道:

>Hi, yingjie.
>If the network is not stable, which config parameter I should adjust.
>
>yidan zhao <[hidden email]> 于2021年6月16日周三 下午6:56写道:
>>
>> 2: I use G1, and no full gc occurred, young gc count: 422, time:
>> 142892, so it is not bad.
>> 3: stream job.
>> 4: I will try to config taskmanager.network.retries which is default
>> 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
>> is 120s。
>> 5: I checked the net fd number of the taskmanager, it is about 1000+,
>> so I think it is a reasonable value.
>>
>> 1: can not be sure.
>>
>> Yingjie Cao <[hidden email]> 于2021年6月16日周三 下午4:34写道:
>> >
>> > Hi yidan,
>> >
>> > 1. Is the network stable?
>> > 2. Is there any GC problem?
>> > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
>> > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
>> > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
>> >
>> > Hope this helps.
>> >
>> > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
>> > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
>> > [3] https://issues.apache.org/jira/browse/FLINK-22643
>> >
>> > Best,
>> > Yingjie
>> >
>> > yidan zhao <[hidden email]> 于2021年6月16日周三 下午3:36写道:
>> >>
>> >> Attachment is the exception stack from flink's web-ui. Does anyone
>> >> have also met this problem?
>> >>
>> >> Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
>> >> each 28G mem.
Reply | Threaded
Open this post in threaded view
|

Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

nobleyd
@东东 standalone集群。 随机时间,一会一个的,没有固定规律。  和CPU、内存、网络的话有一定规律,但不确认,因为不是很明显。
我排查过几个exception,时间和网络尖刺对上了,但不全能对上,所以不好说是否有这个原因。

此外,有个点我不是很清楚,网上这个报错很少,类似的都是
RemoteTransportException,然后提示中说taskmager可能已丢失之类。但我的是
LocalTransportException,不清楚netty中这俩错误的含义是不是不一样。目前来看网络上关于这俩异常的资料也查不到什么。

东东 <[hidden email]> 于2021年6月17日周四 上午11:19写道:

>
> 单机standalone,还是Docker/K8s ?
>
>
>
> 这个异常出现的时机,与周期性的,还是跟CPU、内存,乃至网络流量变化相关?
>
>
>
> 在 2021-06-16 19:10:24,"yidan zhao" <[hidden email]> 写道:
> >Hi, yingjie.
> >If the network is not stable, which config parameter I should adjust.
> >
> >yidan zhao <[hidden email]> 于2021年6月16日周三 下午6:56写道:
> >>
> >> 2: I use G1, and no full gc occurred, young gc count: 422, time:
> >> 142892, so it is not bad.
> >> 3: stream job.
> >> 4: I will try to config taskmanager.network.retries which is default
> >> 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
> >> is 120s。
> >> 5: I checked the net fd number of the taskmanager, it is about 1000+,
> >> so I think it is a reasonable value.
> >>
> >> 1: can not be sure.
> >>
> >> Yingjie Cao <[hidden email]> 于2021年6月16日周三 下午4:34写道:
> >> >
> >> > Hi yidan,
> >> >
> >> > 1. Is the network stable?
> >> > 2. Is there any GC problem?
> >> > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
> >> > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
> >> > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
> >> >
> >> > Hope this helps.
> >> >
> >> > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
> >> > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
> >> > [3] https://issues.apache.org/jira/browse/FLINK-22643
> >> >
> >> > Best,
> >> > Yingjie
> >> >
> >> > yidan zhao <[hidden email]> 于2021年6月16日周三 下午3:36写道:
> >> >>
> >> >> Attachment is the exception stack from flink's web-ui. Does anyone
> >> >> have also met this problem?
> >> >>
> >> >> Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
> >> >> each 28G mem.
Reply | Threaded
Open this post in threaded view
|

Re:Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

东东
10.35.215.18是宿主机IP?

看一下  tcp_tw_recycle和net.ipv4.tcp_timestamps是什么值
实在不行就 tcpdump 吧



在 2021-06-17 12:41:58,"yidan zhao" <[hidden email]> 写道:

>@东东 standalone集群。 随机时间,一会一个的,没有固定规律。  和CPU、内存、网络的话有一定规律,但不确认,因为不是很明显。
>我排查过几个exception,时间和网络尖刺对上了,但不全能对上,所以不好说是否有这个原因。
>
>此外,有个点我不是很清楚,网上这个报错很少,类似的都是
>RemoteTransportException,然后提示中说taskmager可能已丢失之类。但我的是
>LocalTransportException,不清楚netty中这俩错误的含义是不是不一样。目前来看网络上关于这俩异常的资料也查不到什么。
>
>东东 <[hidden email]> 于2021年6月17日周四 上午11:19写道:
>>
>> 单机standalone,还是Docker/K8s ?
>>
>>
>>
>> 这个异常出现的时机,与周期性的,还是跟CPU、内存,乃至网络流量变化相关?
>>
>>
>>
>> 在 2021-06-16 19:10:24,"yidan zhao" <[hidden email]> 写道:
>> >Hi, yingjie.
>> >If the network is not stable, which config parameter I should adjust.
>> >
>> >yidan zhao <[hidden email]> 于2021年6月16日周三 下午6:56写道:
>> >>
>> >> 2: I use G1, and no full gc occurred, young gc count: 422, time:
>> >> 142892, so it is not bad.
>> >> 3: stream job.
>> >> 4: I will try to config taskmanager.network.retries which is default
>> >> 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
>> >> is 120s。
>> >> 5: I checked the net fd number of the taskmanager, it is about 1000+,
>> >> so I think it is a reasonable value.
>> >>
>> >> 1: can not be sure.
>> >>
>> >> Yingjie Cao <[hidden email]> 于2021年6月16日周三 下午4:34写道:
>> >> >
>> >> > Hi yidan,
>> >> >
>> >> > 1. Is the network stable?
>> >> > 2. Is there any GC problem?
>> >> > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
>> >> > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
>> >> > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
>> >> >
>> >> > Hope this helps.
>> >> >
>> >> > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
>> >> > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
>> >> > [3] https://issues.apache.org/jira/browse/FLINK-22643
>> >> >
>> >> > Best,
>> >> > Yingjie
>> >> >
>> >> > yidan zhao <[hidden email]> 于2021年6月16日周三 下午3:36写道:
>> >> >>
>> >> >> Attachment is the exception stack from flink's web-ui. Does anyone
>> >> >> have also met this problem?
>> >> >>
>> >> >> Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
>> >> >> each 28G mem.
Reply | Threaded
Open this post in threaded view
|

Re: Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

nobleyd
是的,宿主机IP。

net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_timestamps = 1

东东 <[hidden email]> 于2021年6月17日周四 下午12:52写道:

>
> 10.35.215.18是宿主机IP?
>
> 看一下  tcp_tw_recycle和net.ipv4.tcp_timestamps是什么值
> 实在不行就 tcpdump 吧
>
>
>
> 在 2021-06-17 12:41:58,"yidan zhao" <[hidden email]> 写道:
> >@东东 standalone集群。 随机时间,一会一个的,没有固定规律。  和CPU、内存、网络的话有一定规律,但不确认,因为不是很明显。
> >我排查过几个exception,时间和网络尖刺对上了,但不全能对上,所以不好说是否有这个原因。
> >
> >此外,有个点我不是很清楚,网上这个报错很少,类似的都是
> >RemoteTransportException,然后提示中说taskmager可能已丢失之类。但我的是
> >LocalTransportException,不清楚netty中这俩错误的含义是不是不一样。目前来看网络上关于这俩异常的资料也查不到什么。
> >
> >东东 <[hidden email]> 于2021年6月17日周四 上午11:19写道:
> >>
> >> 单机standalone,还是Docker/K8s ?
> >>
> >>
> >>
> >> 这个异常出现的时机,与周期性的,还是跟CPU、内存,乃至网络流量变化相关?
> >>
> >>
> >>
> >> 在 2021-06-16 19:10:24,"yidan zhao" <[hidden email]> 写道:
> >> >Hi, yingjie.
> >> >If the network is not stable, which config parameter I should adjust.
> >> >
> >> >yidan zhao <[hidden email]> 于2021年6月16日周三 下午6:56写道:
> >> >>
> >> >> 2: I use G1, and no full gc occurred, young gc count: 422, time:
> >> >> 142892, so it is not bad.
> >> >> 3: stream job.
> >> >> 4: I will try to config taskmanager.network.retries which is default
> >> >> 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
> >> >> is 120s。
> >> >> 5: I checked the net fd number of the taskmanager, it is about 1000+,
> >> >> so I think it is a reasonable value.
> >> >>
> >> >> 1: can not be sure.
> >> >>
> >> >> Yingjie Cao <[hidden email]> 于2021年6月16日周三 下午4:34写道:
> >> >> >
> >> >> > Hi yidan,
> >> >> >
> >> >> > 1. Is the network stable?
> >> >> > 2. Is there any GC problem?
> >> >> > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
> >> >> > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
> >> >> > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
> >> >> >
> >> >> > Hope this helps.
> >> >> >
> >> >> > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
> >> >> > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
> >> >> > [3] https://issues.apache.org/jira/browse/FLINK-22643
> >> >> >
> >> >> > Best,
> >> >> > Yingjie
> >> >> >
> >> >> > yidan zhao <[hidden email]> 于2021年6月16日周三 下午3:36写道:
> >> >> >>
> >> >> >> Attachment is the exception stack from flink's web-ui. Does anyone
> >> >> >> have also met this problem?
> >> >> >>
> >> >> >> Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
> >> >> >> each 28G mem.
Reply | Threaded
Open this post in threaded view
|

Re:Re: Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

东东


把其中一个改成0


在 2021-06-17 13:11:01,"yidan zhao" <[hidden email]> 写道:

>是的,宿主机IP。
>
>net.ipv4.tcp_tw_reuse = 1
>net.ipv4.tcp_timestamps = 1
>
>东东 <[hidden email]> 于2021年6月17日周四 下午12:52写道:
>>
>> 10.35.215.18是宿主机IP?
>>
>> 看一下  tcp_tw_recycle和net.ipv4.tcp_timestamps是什么值
>> 实在不行就 tcpdump 吧
>>
>>
>>
>> 在 2021-06-17 12:41:58,"yidan zhao" <[hidden email]> 写道:
>> >@东东 standalone集群。 随机时间,一会一个的,没有固定规律。  和CPU、内存、网络的话有一定规律,但不确认,因为不是很明显。
>> >我排查过几个exception,时间和网络尖刺对上了,但不全能对上,所以不好说是否有这个原因。
>> >
>> >此外,有个点我不是很清楚,网上这个报错很少,类似的都是
>> >RemoteTransportException,然后提示中说taskmager可能已丢失之类。但我的是
>> >LocalTransportException,不清楚netty中这俩错误的含义是不是不一样。目前来看网络上关于这俩异常的资料也查不到什么。
>> >
>> >东东 <[hidden email]> 于2021年6月17日周四 上午11:19写道:
>> >>
>> >> 单机standalone,还是Docker/K8s ?
>> >>
>> >>
>> >>
>> >> 这个异常出现的时机,与周期性的,还是跟CPU、内存,乃至网络流量变化相关?
>> >>
>> >>
>> >>
>> >> 在 2021-06-16 19:10:24,"yidan zhao" <[hidden email]> 写道:
>> >> >Hi, yingjie.
>> >> >If the network is not stable, which config parameter I should adjust.
>> >> >
>> >> >yidan zhao <[hidden email]> 于2021年6月16日周三 下午6:56写道:
>> >> >>
>> >> >> 2: I use G1, and no full gc occurred, young gc count: 422, time:
>> >> >> 142892, so it is not bad.
>> >> >> 3: stream job.
>> >> >> 4: I will try to config taskmanager.network.retries which is default
>> >> >> 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
>> >> >> is 120s。
>> >> >> 5: I checked the net fd number of the taskmanager, it is about 1000+,
>> >> >> so I think it is a reasonable value.
>> >> >>
>> >> >> 1: can not be sure.
>> >> >>
>> >> >> Yingjie Cao <[hidden email]> 于2021年6月16日周三 下午4:34写道:
>> >> >> >
>> >> >> > Hi yidan,
>> >> >> >
>> >> >> > 1. Is the network stable?
>> >> >> > 2. Is there any GC problem?
>> >> >> > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
>> >> >> > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
>> >> >> > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
>> >> >> >
>> >> >> > Hope this helps.
>> >> >> >
>> >> >> > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
>> >> >> > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
>> >> >> > [3] https://issues.apache.org/jira/browse/FLINK-22643
>> >> >> >
>> >> >> > Best,
>> >> >> > Yingjie
>> >> >> >
>> >> >> > yidan zhao <[hidden email]> 于2021年6月16日周三 下午3:36写道:
>> >> >> >>
>> >> >> >> Attachment is the exception stack from flink's web-ui. Does anyone
>> >> >> >> have also met this problem?
>> >> >> >>
>> >> >> >> Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
>> >> >> >> each 28G mem.
Reply | Threaded
Open this post in threaded view
|

Re: Re: Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

nobleyd
这啥原理,这个改动我没办法直接改,需要申请。

东东 <[hidden email]> 于2021年6月17日周四 下午1:36写道:

>
>
>
> 把其中一个改成0
>
>
> 在 2021-06-17 13:11:01,"yidan zhao" <[hidden email]> 写道:
> >是的,宿主机IP。
> >
> >net.ipv4.tcp_tw_reuse = 1
> >net.ipv4.tcp_timestamps = 1
> >
> >东东 <[hidden email]> 于2021年6月17日周四 下午12:52写道:
> >>
> >> 10.35.215.18是宿主机IP?
> >>
> >> 看一下  tcp_tw_recycle和net.ipv4.tcp_timestamps是什么值
> >> 实在不行就 tcpdump 吧
> >>
> >>
> >>
> >> 在 2021-06-17 12:41:58,"yidan zhao" <[hidden email]> 写道:
> >> >@东东 standalone集群。 随机时间,一会一个的,没有固定规律。  和CPU、内存、网络的话有一定规律,但不确认,因为不是很明显。
> >> >我排查过几个exception,时间和网络尖刺对上了,但不全能对上,所以不好说是否有这个原因。
> >> >
> >> >此外,有个点我不是很清楚,网上这个报错很少,类似的都是
> >> >RemoteTransportException,然后提示中说taskmager可能已丢失之类。但我的是
> >> >LocalTransportException,不清楚netty中这俩错误的含义是不是不一样。目前来看网络上关于这俩异常的资料也查不到什么。
> >> >
> >> >东东 <[hidden email]> 于2021年6月17日周四 上午11:19写道:
> >> >>
> >> >> 单机standalone,还是Docker/K8s ?
> >> >>
> >> >>
> >> >>
> >> >> 这个异常出现的时机,与周期性的,还是跟CPU、内存,乃至网络流量变化相关?
> >> >>
> >> >>
> >> >>
> >> >> 在 2021-06-16 19:10:24,"yidan zhao" <[hidden email]> 写道:
> >> >> >Hi, yingjie.
> >> >> >If the network is not stable, which config parameter I should adjust.
> >> >> >
> >> >> >yidan zhao <[hidden email]> 于2021年6月16日周三 下午6:56写道:
> >> >> >>
> >> >> >> 2: I use G1, and no full gc occurred, young gc count: 422, time:
> >> >> >> 142892, so it is not bad.
> >> >> >> 3: stream job.
> >> >> >> 4: I will try to config taskmanager.network.retries which is default
> >> >> >> 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
> >> >> >> is 120s。
> >> >> >> 5: I checked the net fd number of the taskmanager, it is about 1000+,
> >> >> >> so I think it is a reasonable value.
> >> >> >>
> >> >> >> 1: can not be sure.
> >> >> >>
> >> >> >> Yingjie Cao <[hidden email]> 于2021年6月16日周三 下午4:34写道:
> >> >> >> >
> >> >> >> > Hi yidan,
> >> >> >> >
> >> >> >> > 1. Is the network stable?
> >> >> >> > 2. Is there any GC problem?
> >> >> >> > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
> >> >> >> > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
> >> >> >> > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
> >> >> >> >
> >> >> >> > Hope this helps.
> >> >> >> >
> >> >> >> > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
> >> >> >> > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
> >> >> >> > [3] https://issues.apache.org/jira/browse/FLINK-22643
> >> >> >> >
> >> >> >> > Best,
> >> >> >> > Yingjie
> >> >> >> >
> >> >> >> > yidan zhao <[hidden email]> 于2021年6月16日周三 下午3:36写道:
> >> >> >> >>
> >> >> >> >> Attachment is the exception stack from flink's web-ui. Does anyone
> >> >> >> >> have also met this problem?
> >> >> >> >>
> >> >> >> >> Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
> >> >> >> >> each 28G mem.
Reply | Threaded
Open this post in threaded view
|

Re:Re: Re: Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

东东
这俩都开启的话,就要求同一源ip的连接请求中的timstamp必须是递增的,否则(非递增)的连接请求被视为无效,数据包会被抛弃,给client端的感觉就是时不时的连接超时。



一般来说单机不会有这个问题,因为时钟应该是一个,在NAT后面才容易出现这个现象(因为多个主机时钟通常不完全一致),但不清楚你的具体架构,只能说试一试。


最后,可以跟运维讨论一下,除非确信不会有经过NAT过来的链接,否则这俩最好别都开。


PS: kernel 4.1里面已经把 tcp_tw_reuse 这玩意废掉了,因为太多人掉这坑里了


在 2021-06-17 14:07:50,"yidan zhao" <[hidden email]> 写道:

>这啥原理,这个改动我没办法直接改,需要申请。
>
>东东 <[hidden email]> 于2021年6月17日周四 下午1:36写道:
>>
>>
>>
>> 把其中一个改成0
>>
>>
>> 在 2021-06-17 13:11:01,"yidan zhao" <[hidden email]> 写道:
>> >是的,宿主机IP。
>> >
>> >net.ipv4.tcp_tw_reuse = 1
>> >net.ipv4.tcp_timestamps = 1
>> >
>> >东东 <[hidden email]> 于2021年6月17日周四 下午12:52写道:
>> >>
>> >> 10.35.215.18是宿主机IP?
>> >>
>> >> 看一下  tcp_tw_recycle和net.ipv4.tcp_timestamps是什么值
>> >> 实在不行就 tcpdump 吧
>> >>
>> >>
>> >>
>> >> 在 2021-06-17 12:41:58,"yidan zhao" <[hidden email]> 写道:
>> >> >@东东 standalone集群。 随机时间,一会一个的,没有固定规律。  和CPU、内存、网络的话有一定规律,但不确认,因为不是很明显。
>> >> >我排查过几个exception,时间和网络尖刺对上了,但不全能对上,所以不好说是否有这个原因。
>> >> >
>> >> >此外,有个点我不是很清楚,网上这个报错很少,类似的都是
>> >> >RemoteTransportException,然后提示中说taskmager可能已丢失之类。但我的是
>> >> >LocalTransportException,不清楚netty中这俩错误的含义是不是不一样。目前来看网络上关于这俩异常的资料也查不到什么。
>> >> >
>> >> >东东 <[hidden email]> 于2021年6月17日周四 上午11:19写道:
>> >> >>
>> >> >> 单机standalone,还是Docker/K8s ?
>> >> >>
>> >> >>
>> >> >>
>> >> >> 这个异常出现的时机,与周期性的,还是跟CPU、内存,乃至网络流量变化相关?
>> >> >>
>> >> >>
>> >> >>
>> >> >> 在 2021-06-16 19:10:24,"yidan zhao" <[hidden email]> 写道:
>> >> >> >Hi, yingjie.
>> >> >> >If the network is not stable, which config parameter I should adjust.
>> >> >> >
>> >> >> >yidan zhao <[hidden email]> 于2021年6月16日周三 下午6:56写道:
>> >> >> >>
>> >> >> >> 2: I use G1, and no full gc occurred, young gc count: 422, time:
>> >> >> >> 142892, so it is not bad.
>> >> >> >> 3: stream job.
>> >> >> >> 4: I will try to config taskmanager.network.retries which is default
>> >> >> >> 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
>> >> >> >> is 120s。
>> >> >> >> 5: I checked the net fd number of the taskmanager, it is about 1000+,
>> >> >> >> so I think it is a reasonable value.
>> >> >> >>
>> >> >> >> 1: can not be sure.
>> >> >> >>
>> >> >> >> Yingjie Cao <[hidden email]> 于2021年6月16日周三 下午4:34写道:
>> >> >> >> >
>> >> >> >> > Hi yidan,
>> >> >> >> >
>> >> >> >> > 1. Is the network stable?
>> >> >> >> > 2. Is there any GC problem?
>> >> >> >> > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
>> >> >> >> > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
>> >> >> >> > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
>> >> >> >> >
>> >> >> >> > Hope this helps.
>> >> >> >> >
>> >> >> >> > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
>> >> >> >> > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
>> >> >> >> > [3] https://issues.apache.org/jira/browse/FLINK-22643
>> >> >> >> >
>> >> >> >> > Best,
>> >> >> >> > Yingjie
>> >> >> >> >
>> >> >> >> > yidan zhao <[hidden email]> 于2021年6月16日周三 下午3:36写道:
>> >> >> >> >>
>> >> >> >> >> Attachment is the exception stack from flink's web-ui. Does anyone
>> >> >> >> >> have also met this problem?
>> >> >> >> >>
>> >> >> >> >> Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
>> >> >> >> >> each 28G mem.
Reply | Threaded
Open this post in threaded view
|

Re: Re: Re: Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

nobleyd
我仔细想了想,我的集群是内网服务器上的容器,容器之间访问应该不算经过NAT。

当然和网络相关的监控来看,的确很多机器的time-wait状态的连接不少,在5w+个左右,但也不至于导致这个问题感觉。

东东 <[hidden email]> 于2021年6月17日周四 下午2:48写道:

>
> 这俩都开启的话,就要求同一源ip的连接请求中的timstamp必须是递增的,否则(非递增)的连接请求被视为无效,数据包会被抛弃,给client端的感觉就是时不时的连接超时。
>
>
>
> 一般来说单机不会有这个问题,因为时钟应该是一个,在NAT后面才容易出现这个现象(因为多个主机时钟通常不完全一致),但不清楚你的具体架构,只能说试一试。
>
>
> 最后,可以跟运维讨论一下,除非确信不会有经过NAT过来的链接,否则这俩最好别都开。
>
>
> PS: kernel 4.1里面已经把 tcp_tw_reuse 这玩意废掉了,因为太多人掉这坑里了
>
>
> 在 2021-06-17 14:07:50,"yidan zhao" <[hidden email]> 写道:
> >这啥原理,这个改动我没办法直接改,需要申请。
> >
> >东东 <[hidden email]> 于2021年6月17日周四 下午1:36写道:
> >>
> >>
> >>
> >> 把其中一个改成0
> >>
> >>
> >> 在 2021-06-17 13:11:01,"yidan zhao" <[hidden email]> 写道:
> >> >是的,宿主机IP。
> >> >
> >> >net.ipv4.tcp_tw_reuse = 1
> >> >net.ipv4.tcp_timestamps = 1
> >> >
> >> >东东 <[hidden email]> 于2021年6月17日周四 下午12:52写道:
> >> >>
> >> >> 10.35.215.18是宿主机IP?
> >> >>
> >> >> 看一下  tcp_tw_recycle和net.ipv4.tcp_timestamps是什么值
> >> >> 实在不行就 tcpdump 吧
> >> >>
> >> >>
> >> >>
> >> >> 在 2021-06-17 12:41:58,"yidan zhao" <[hidden email]> 写道:
> >> >> >@东东 standalone集群。 随机时间,一会一个的,没有固定规律。  和CPU、内存、网络的话有一定规律,但不确认,因为不是很明显。
> >> >> >我排查过几个exception,时间和网络尖刺对上了,但不全能对上,所以不好说是否有这个原因。
> >> >> >
> >> >> >此外,有个点我不是很清楚,网上这个报错很少,类似的都是
> >> >> >RemoteTransportException,然后提示中说taskmager可能已丢失之类。但我的是
> >> >> >LocalTransportException,不清楚netty中这俩错误的含义是不是不一样。目前来看网络上关于这俩异常的资料也查不到什么。
> >> >> >
> >> >> >东东 <[hidden email]> 于2021年6月17日周四 上午11:19写道:
> >> >> >>
> >> >> >> 单机standalone,还是Docker/K8s ?
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> 这个异常出现的时机,与周期性的,还是跟CPU、内存,乃至网络流量变化相关?
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> 在 2021-06-16 19:10:24,"yidan zhao" <[hidden email]> 写道:
> >> >> >> >Hi, yingjie.
> >> >> >> >If the network is not stable, which config parameter I should adjust.
> >> >> >> >
> >> >> >> >yidan zhao <[hidden email]> 于2021年6月16日周三 下午6:56写道:
> >> >> >> >>
> >> >> >> >> 2: I use G1, and no full gc occurred, young gc count: 422, time:
> >> >> >> >> 142892, so it is not bad.
> >> >> >> >> 3: stream job.
> >> >> >> >> 4: I will try to config taskmanager.network.retries which is default
> >> >> >> >> 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
> >> >> >> >> is 120s。
> >> >> >> >> 5: I checked the net fd number of the taskmanager, it is about 1000+,
> >> >> >> >> so I think it is a reasonable value.
> >> >> >> >>
> >> >> >> >> 1: can not be sure.
> >> >> >> >>
> >> >> >> >> Yingjie Cao <[hidden email]> 于2021年6月16日周三 下午4:34写道:
> >> >> >> >> >
> >> >> >> >> > Hi yidan,
> >> >> >> >> >
> >> >> >> >> > 1. Is the network stable?
> >> >> >> >> > 2. Is there any GC problem?
> >> >> >> >> > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
> >> >> >> >> > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
> >> >> >> >> > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
> >> >> >> >> >
> >> >> >> >> > Hope this helps.
> >> >> >> >> >
> >> >> >> >> > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
> >> >> >> >> > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
> >> >> >> >> > [3] https://issues.apache.org/jira/browse/FLINK-22643
> >> >> >> >> >
> >> >> >> >> > Best,
> >> >> >> >> > Yingjie
> >> >> >> >> >
> >> >> >> >> > yidan zhao <[hidden email]> 于2021年6月16日周三 下午3:36写道:
> >> >> >> >> >>
> >> >> >> >> >> Attachment is the exception stack from flink's web-ui. Does anyone
> >> >> >> >> >> have also met this problem?
> >> >> >> >> >>
> >> >> >> >> >> Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
> >> >> >> >> >> each 28G mem.