Flink 1.11 submit job timed out

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink 1.11 submit job timed out

Chris Guo

Hi

使用版本Flink 1.11,部署方式 kubernetes session。 TM个数30个,每个TM 4个slot。 job 并行度120.提交作业的时候出现大量的No hostname could be resolved for the IP address,JM time out,作业提交失败。web ui也会卡主无响应。

用wordCount,并行度只有1提交也会刷,no hostname的日志会刷个几条,然后正常提交,如果并行度一上去,就会超时。


部分日志如下:

2020-07-15 16:58:46,460 WARN  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No hostname could be resolved for the IP address 10.32.160.7, using IP address as host name. Local input split assignment (such as for HDFS files) may be impacted.
2020-07-15 16:58:46,460 WARN  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No hostname could be resolved for the IP address 10.44.224.7, using IP address as host name. Local input split assignment (such as for HDFS files) may be impacted.
2020-07-15 16:58:46,461 WARN  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No hostname could be resolved for the IP address 10.40.32.9, using IP address as host name. Local input split assignment (such as for HDFS files) may be impacted.

2020-07-15 16:59:10,236 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - The heartbeat of JobManager with id 69a0d460de468888a9f41c770d963c0a timed out.
2020-07-15 16:59:10,236 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Disconnect job manager [hidden email]://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job e1554c737e37ed79688a15c746b6e9ef from the resource manager.


how to deal with ?


beset !

| |
a511955993
|
|
邮箱:[hidden email]
|

签名由 网易邮箱大师 定制
Reply | Threaded
Open this post in threaded view
|

Re:Flink 1.11 submit job timed out

Roc Marshal
Hi,SmileSmile.
个人之前有遇到过 类似 的host解析问题,可以从k8s的pod节点网络映射角度排查一下。
希望这对你有帮助。


祝好。
Roc Marshal











在 2020-07-15 17:04:18,"SmileSmile" <[hidden email]> 写道:

>
>Hi
>
>使用版本Flink 1.11,部署方式 kubernetes session。 TM个数30个,每个TM 4个slot。 job 并行度120.提交作业的时候出现大量的No hostname could be resolved for the IP address,JM time out,作业提交失败。web ui也会卡主无响应。
>
>用wordCount,并行度只有1提交也会刷,no hostname的日志会刷个几条,然后正常提交,如果并行度一上去,就会超时。
>
>
>部分日志如下:
>
>2020-07-15 16:58:46,460 WARN  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No hostname could be resolved for the IP address 10.32.160.7, using IP address as host name. Local input split assignment (such as for HDFS files) may be impacted.
>2020-07-15 16:58:46,460 WARN  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No hostname could be resolved for the IP address 10.44.224.7, using IP address as host name. Local input split assignment (such as for HDFS files) may be impacted.
>2020-07-15 16:58:46,461 WARN  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No hostname could be resolved for the IP address 10.40.32.9, using IP address as host name. Local input split assignment (such as for HDFS files) may be impacted.
>
>2020-07-15 16:59:10,236 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - The heartbeat of JobManager with id 69a0d460de468888a9f41c770d963c0a timed out.
>2020-07-15 16:59:10,236 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Disconnect job manager [hidden email]://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job e1554c737e37ed79688a15c746b6e9ef from the resource manager.
>
>
>how to deal with ?
>
>
>beset !
>
>| |
>a511955993
>|
>|
>邮箱:[hidden email]
>|
>
>签名由 网易邮箱大师 定制
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.11 submit job timed out

Chris Guo
Hi Roc

该现象在1.10.1版本没有,在1.11版本才出现。请问这个该如何查比较合适



| |
a511955993
|
|
邮箱:[hidden email]
|

签名由 网易邮箱大师 定制

On 07/15/2020 17:16, Roc Marshal wrote:
Hi,SmileSmile.
个人之前有遇到过 类似 的host解析问题,可以从k8s的pod节点网络映射角度排查一下。
希望这对你有帮助。


祝好。
Roc Marshal











在 2020-07-15 17:04:18,"SmileSmile" <[hidden email]> 写道:

>
>Hi
>
>使用版本Flink 1.11,部署方式 kubernetes session。 TM个数30个,每个TM 4个slot。 job 并行度120.提交作业的时候出现大量的No hostname could be resolved for the IP address,JM time out,作业提交失败。web ui也会卡主无响应。
>
>用wordCount,并行度只有1提交也会刷,no hostname的日志会刷个几条,然后正常提交,如果并行度一上去,就会超时。
>
>
>部分日志如下:
>
>2020-07-15 16:58:46,460 WARN  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No hostname could be resolved for the IP address 10.32.160.7, using IP address as host name. Local input split assignment (such as for HDFS files) may be impacted.
>2020-07-15 16:58:46,460 WARN  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No hostname could be resolved for the IP address 10.44.224.7, using IP address as host name. Local input split assignment (such as for HDFS files) may be impacted.
>2020-07-15 16:58:46,461 WARN  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No hostname could be resolved for the IP address 10.40.32.9, using IP address as host name. Local input split assignment (such as for HDFS files) may be impacted.
>
>2020-07-15 16:59:10,236 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - The heartbeat of JobManager with id 69a0d460de468888a9f41c770d963c0a timed out.
>2020-07-15 16:59:10,236 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Disconnect job manager [hidden email]://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job e1554c737e37ed79688a15c746b6e9ef from the resource manager.
>
>
>how to deal with ?
>
>
>beset !
>
>| |
>a511955993
>|
>|
>邮箱:[hidden email]
>|
>
>签名由 网易邮箱大师 定制
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.11 submit job timed out

Congxian Qiu
Hi
   如果没有异常,GC 情况也正常的话,或许可以看一下 pod 的相关日志,如果开启了 HA 也可以看一下 zk 的日志。之前遇到过一次在 Yarn
环境中类似的现象是由于其他原因导致的,通过看 NM 日志以及 zk 日志发现的原因。

Best,
Congxian


SmileSmile <[hidden email]> 于2020年7月15日周三 下午5:20写道:

> Hi Roc
>
> 该现象在1.10.1版本没有,在1.11版本才出现。请问这个该如何查比较合适
>
>
>
> | |
> a511955993
> |
> |
> 邮箱:[hidden email]
> |
>
> 签名由 网易邮箱大师 定制
>
> On 07/15/2020 17:16, Roc Marshal wrote:
> Hi,SmileSmile.
> 个人之前有遇到过 类似 的host解析问题,可以从k8s的pod节点网络映射角度排查一下。
> 希望这对你有帮助。
>
>
> 祝好。
> Roc Marshal
>
>
>
>
>
>
>
>
>
>
>
> 在 2020-07-15 17:04:18,"SmileSmile" <[hidden email]> 写道:
> >
> >Hi
> >
> >使用版本Flink 1.11,部署方式 kubernetes session。 TM个数30个,每个TM 4个slot。 job
> 并行度120.提交作业的时候出现大量的No hostname could be resolved for the IP address,JM time
> out,作业提交失败。web ui也会卡主无响应。
> >
> >用wordCount,并行度只有1提交也会刷,no hostname的日志会刷个几条,然后正常提交,如果并行度一上去,就会超时。
> >
> >
> >部分日志如下:
> >
> >2020-07-15 16:58:46,460 WARN
> org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> hostname could be resolved for the IP address 10.32.160.7, using IP address
> as host name. Local input split assignment (such as for HDFS files) may be
> impacted.
> >2020-07-15 16:58:46,460 WARN
> org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> hostname could be resolved for the IP address 10.44.224.7, using IP address
> as host name. Local input split assignment (such as for HDFS files) may be
> impacted.
> >2020-07-15 16:58:46,461 WARN
> org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> hostname could be resolved for the IP address 10.40.32.9, using IP address
> as host name. Local input split assignment (such as for HDFS files) may be
> impacted.
> >
> >2020-07-15 16:59:10,236 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - The
> heartbeat of JobManager with id 69a0d460de468888a9f41c770d963c0a timed out.
> >2020-07-15 16:59:10,236 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Disconnect job manager 00000000000000000000000000000000
> @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> e1554c737e37ed79688a15c746b6e9ef from the resource manager.
> >
> >
> >how to deal with ?
> >
> >
> >beset !
> >
> >| |
> >a511955993
> >|
> >|
> >邮箱:[hidden email]
> >|
> >
> >签名由 网易邮箱大师 定制
>
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.11 submit job timed out

Chris Guo
Hi,Congxian

因为是测试环境,没有配置HA,目前看到的信息,就是JM刷出来大量的no hostname could be resolved,jm失联,作业提交失败。 将jm内存配置为10g也是一样的情况(jobmanager.memory.pprocesa.size:10240m)。

在同一个环境将版本回退到1.10没有出现该问题,也不会刷如上报错。


是否有其他排查思路?

Best!




| |
a511955993
|
|
邮箱:[hidden email]
|

签名由 网易邮箱大师 定制

On 07/16/2020 13:17, Congxian Qiu wrote:
Hi
  如果没有异常,GC 情况也正常的话,或许可以看一下 pod 的相关日志,如果开启了 HA 也可以看一下 zk 的日志。之前遇到过一次在 Yarn
环境中类似的现象是由于其他原因导致的,通过看 NM 日志以及 zk 日志发现的原因。

Best,
Congxian


SmileSmile <[hidden email]> 于2020年7月15日周三 下午5:20写道:

> Hi Roc
>
> 该现象在1.10.1版本没有,在1.11版本才出现。请问这个该如何查比较合适
>
>
>
> | |
> a511955993
> |
> |
> 邮箱:[hidden email]
> |
>
> 签名由 网易邮箱大师 定制
>
> On 07/15/2020 17:16, Roc Marshal wrote:
> Hi,SmileSmile.
> 个人之前有遇到过 类似 的host解析问题,可以从k8s的pod节点网络映射角度排查一下。
> 希望这对你有帮助。
>
>
> 祝好。
> Roc Marshal
>
>
>
>
>
>
>
>
>
>
>
> 在 2020-07-15 17:04:18,"SmileSmile" <[hidden email]> 写道:
> >
> >Hi
> >
> >使用版本Flink 1.11,部署方式 kubernetes session。 TM个数30个,每个TM 4个slot。 job
> 并行度120.提交作业的时候出现大量的No hostname could be resolved for the IP address,JM time
> out,作业提交失败。web ui也会卡主无响应。
> >
> >用wordCount,并行度只有1提交也会刷,no hostname的日志会刷个几条,然后正常提交,如果并行度一上去,就会超时。
> >
> >
> >部分日志如下:
> >
> >2020-07-15 16:58:46,460 WARN
> org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> hostname could be resolved for the IP address 10.32.160.7, using IP address
> as host name. Local input split assignment (such as for HDFS files) may be
> impacted.
> >2020-07-15 16:58:46,460 WARN
> org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> hostname could be resolved for the IP address 10.44.224.7, using IP address
> as host name. Local input split assignment (such as for HDFS files) may be
> impacted.
> >2020-07-15 16:58:46,461 WARN
> org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> hostname could be resolved for the IP address 10.40.32.9, using IP address
> as host name. Local input split assignment (such as for HDFS files) may be
> impacted.
> >
> >2020-07-15 16:59:10,236 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - The
> heartbeat of JobManager with id 69a0d460de468888a9f41c770d963c0a timed out.
> >2020-07-15 16:59:10,236 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Disconnect job manager 00000000000000000000000000000000
> @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> e1554c737e37ed79688a15c746b6e9ef from the resource manager.
> >
> >
> >how to deal with ?
> >
> >
> >beset !
> >
> >| |
> >a511955993
> >|
> >|
> >邮箱:[hidden email]
> >|
> >
> >签名由 网易邮箱大师 定制
>
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.11 submit job timed out

Congxian Qiu
Hi
   不确定 k8s 环境中能否看到 pod 的完整日志?类似 Yarn 的 NM 日志一样,如果有的话,可以尝试看一下这个 pod
的完整日志有没有什么发现
Best,
Congxian


SmileSmile <[hidden email]> 于2020年7月21日周二 下午3:19写道:

> Hi,Congxian
>
> 因为是测试环境,没有配置HA,目前看到的信息,就是JM刷出来大量的no hostname could be
> resolved,jm失联,作业提交失败。
> 将jm内存配置为10g也是一样的情况(jobmanager.memory.pprocesa.size:10240m)。
>
> 在同一个环境将版本回退到1.10没有出现该问题,也不会刷如上报错。
>
>
> 是否有其他排查思路?
>
> Best!
>
>
>
>
> | |
> a511955993
> |
> |
> 邮箱:[hidden email]
> |
>
> 签名由 网易邮箱大师 定制
>
> On 07/16/2020 13:17, Congxian Qiu wrote:
> Hi
>   如果没有异常,GC 情况也正常的话,或许可以看一下 pod 的相关日志,如果开启了 HA 也可以看一下 zk 的日志。之前遇到过一次在 Yarn
> 环境中类似的现象是由于其他原因导致的,通过看 NM 日志以及 zk 日志发现的原因。
>
> Best,
> Congxian
>
>
> SmileSmile <[hidden email]> 于2020年7月15日周三 下午5:20写道:
>
> > Hi Roc
> >
> > 该现象在1.10.1版本没有,在1.11版本才出现。请问这个该如何查比较合适
> >
> >
> >
> > | |
> > a511955993
> > |
> > |
> > 邮箱:[hidden email]
> > |
> >
> > 签名由 网易邮箱大师 定制
> >
> > On 07/15/2020 17:16, Roc Marshal wrote:
> > Hi,SmileSmile.
> > 个人之前有遇到过 类似 的host解析问题,可以从k8s的pod节点网络映射角度排查一下。
> > 希望这对你有帮助。
> >
> >
> > 祝好。
> > Roc Marshal
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > 在 2020-07-15 17:04:18,"SmileSmile" <[hidden email]> 写道:
> > >
> > >Hi
> > >
> > >使用版本Flink 1.11,部署方式 kubernetes session。 TM个数30个,每个TM 4个slot。 job
> > 并行度120.提交作业的时候出现大量的No hostname could be resolved for the IP address,JM
> time
> > out,作业提交失败。web ui也会卡主无响应。
> > >
> > >用wordCount,并行度只有1提交也会刷,no hostname的日志会刷个几条,然后正常提交,如果并行度一上去,就会超时。
> > >
> > >
> > >部分日志如下:
> > >
> > >2020-07-15 16:58:46,460 WARN
> > org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> > hostname could be resolved for the IP address 10.32.160.7, using IP
> address
> > as host name. Local input split assignment (such as for HDFS files) may
> be
> > impacted.
> > >2020-07-15 16:58:46,460 WARN
> > org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> > hostname could be resolved for the IP address 10.44.224.7, using IP
> address
> > as host name. Local input split assignment (such as for HDFS files) may
> be
> > impacted.
> > >2020-07-15 16:58:46,461 WARN
> > org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> > hostname could be resolved for the IP address 10.40.32.9, using IP
> address
> > as host name. Local input split assignment (such as for HDFS files) may
> be
> > impacted.
> > >
> > >2020-07-15 16:59:10,236 INFO
> > org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> The
> > heartbeat of JobManager with id 69a0d460de468888a9f41c770d963c0a timed
> out.
> > >2020-07-15 16:59:10,236 INFO
> > org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> > Disconnect job manager 00000000000000000000000000000000
> > @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> > e1554c737e37ed79688a15c746b6e9ef from the resource manager.
> > >
> > >
> > >how to deal with ?
> > >
> > >
> > >beset !
> > >
> > >| |
> > >a511955993
> > >|
> > >|
> > >邮箱:[hidden email]
> > >|
> > >
> > >签名由 网易邮箱大师 定制
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.11 submit job timed out

Yang Wang
如果你的日志里面一直在刷No hostname could be resolved for the IP address,应该是集群的coredns
有问题,由ip地址反查hostname查不到。你可以起一个busybox验证一下是不是这个ip就解析不了,有
可能是coredns有问题


Best,
Yang

Congxian Qiu <[hidden email]> 于2020年7月21日周二 下午7:29写道:

> Hi
>    不确定 k8s 环境中能否看到 pod 的完整日志?类似 Yarn 的 NM 日志一样,如果有的话,可以尝试看一下这个 pod
> 的完整日志有没有什么发现
> Best,
> Congxian
>
>
> SmileSmile <[hidden email]> 于2020年7月21日周二 下午3:19写道:
>
> > Hi,Congxian
> >
> > 因为是测试环境,没有配置HA,目前看到的信息,就是JM刷出来大量的no hostname could be
> > resolved,jm失联,作业提交失败。
> > 将jm内存配置为10g也是一样的情况(jobmanager.memory.pprocesa.size:10240m)。
> >
> > 在同一个环境将版本回退到1.10没有出现该问题,也不会刷如上报错。
> >
> >
> > 是否有其他排查思路?
> >
> > Best!
> >
> >
> >
> >
> > | |
> > a511955993
> > |
> > |
> > 邮箱:[hidden email]
> > |
> >
> > 签名由 网易邮箱大师 定制
> >
> > On 07/16/2020 13:17, Congxian Qiu wrote:
> > Hi
> >   如果没有异常,GC 情况也正常的话,或许可以看一下 pod 的相关日志,如果开启了 HA 也可以看一下 zk 的日志。之前遇到过一次在
> Yarn
> > 环境中类似的现象是由于其他原因导致的,通过看 NM 日志以及 zk 日志发现的原因。
> >
> > Best,
> > Congxian
> >
> >
> > SmileSmile <[hidden email]> 于2020年7月15日周三 下午5:20写道:
> >
> > > Hi Roc
> > >
> > > 该现象在1.10.1版本没有,在1.11版本才出现。请问这个该如何查比较合适
> > >
> > >
> > >
> > > | |
> > > a511955993
> > > |
> > > |
> > > 邮箱:[hidden email]
> > > |
> > >
> > > 签名由 网易邮箱大师 定制
> > >
> > > On 07/15/2020 17:16, Roc Marshal wrote:
> > > Hi,SmileSmile.
> > > 个人之前有遇到过 类似 的host解析问题,可以从k8s的pod节点网络映射角度排查一下。
> > > 希望这对你有帮助。
> > >
> > >
> > > 祝好。
> > > Roc Marshal
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > 在 2020-07-15 17:04:18,"SmileSmile" <[hidden email]> 写道:
> > > >
> > > >Hi
> > > >
> > > >使用版本Flink 1.11,部署方式 kubernetes session。 TM个数30个,每个TM 4个slot。 job
> > > 并行度120.提交作业的时候出现大量的No hostname could be resolved for the IP address,JM
> > time
> > > out,作业提交失败。web ui也会卡主无响应。
> > > >
> > > >用wordCount,并行度只有1提交也会刷,no hostname的日志会刷个几条,然后正常提交,如果并行度一上去,就会超时。
> > > >
> > > >
> > > >部分日志如下:
> > > >
> > > >2020-07-15 16:58:46,460 WARN
> > > org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> > > hostname could be resolved for the IP address 10.32.160.7, using IP
> > address
> > > as host name. Local input split assignment (such as for HDFS files) may
> > be
> > > impacted.
> > > >2020-07-15 16:58:46,460 WARN
> > > org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> > > hostname could be resolved for the IP address 10.44.224.7, using IP
> > address
> > > as host name. Local input split assignment (such as for HDFS files) may
> > be
> > > impacted.
> > > >2020-07-15 16:58:46,461 WARN
> > > org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> > > hostname could be resolved for the IP address 10.40.32.9, using IP
> > address
> > > as host name. Local input split assignment (such as for HDFS files) may
> > be
> > > impacted.
> > > >
> > > >2020-07-15 16:59:10,236 INFO
> > > org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> > The
> > > heartbeat of JobManager with id 69a0d460de468888a9f41c770d963c0a timed
> > out.
> > > >2020-07-15 16:59:10,236 INFO
> > > org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> > > Disconnect job manager 00000000000000000000000000000000
> > > @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> > > e1554c737e37ed79688a15c746b6e9ef from the resource manager.
> > > >
> > > >
> > > >how to deal with ?
> > > >
> > > >
> > > >beset !
> > > >
> > > >| |
> > > >a511955993
> > > >|
> > > >|
> > > >邮箱:[hidden email]
> > > >|
> > > >
> > > >签名由 网易邮箱大师 定制
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.11 submit job timed out

Chris Guo

Hi,Yang Wang!

很开心可以收到你的回复,你的回复帮助很大,让我知道了问题的方向。我再补充些信息,希望可以帮我进一步判断一下问题根源。

在JM报错的地方,No hostname could be resolved for ip address xxxxx ,报出来的ip是k8s分配给flink pod的内网ip,不是宿主机的ip。请问这个问题可能出在哪里呢

Best!



| |
a511955993
|
|
邮箱:[hidden email]
|

签名由 网易邮箱大师 定制

On 07/22/2020 18:18, Yang Wang wrote:
如果你的日志里面一直在刷No hostname could be resolved for the IP address,应该是集群的coredns
有问题,由ip地址反查hostname查不到。你可以起一个busybox验证一下是不是这个ip就解析不了,有
可能是coredns有问题


Best,
Yang

Congxian Qiu <[hidden email]> 于2020年7月21日周二 下午7:29写道:

> Hi
>    不确定 k8s 环境中能否看到 pod 的完整日志?类似 Yarn 的 NM 日志一样,如果有的话,可以尝试看一下这个 pod
> 的完整日志有没有什么发现
> Best,
> Congxian
>
>
> SmileSmile <[hidden email]> 于2020年7月21日周二 下午3:19写道:
>
> > Hi,Congxian
> >
> > 因为是测试环境,没有配置HA,目前看到的信息,就是JM刷出来大量的no hostname could be
> > resolved,jm失联,作业提交失败。
> > 将jm内存配置为10g也是一样的情况(jobmanager.memory.pprocesa.size:10240m)。
> >
> > 在同一个环境将版本回退到1.10没有出现该问题,也不会刷如上报错。
> >
> >
> > 是否有其他排查思路?
> >
> > Best!
> >
> >
> >
> >
> > | |
> > a511955993
> > |
> > |
> > 邮箱:[hidden email]
> > |
> >
> > 签名由 网易邮箱大师 定制
> >
> > On 07/16/2020 13:17, Congxian Qiu wrote:
> > Hi
> >   如果没有异常,GC 情况也正常的话,或许可以看一下 pod 的相关日志,如果开启了 HA 也可以看一下 zk 的日志。之前遇到过一次在
> Yarn
> > 环境中类似的现象是由于其他原因导致的,通过看 NM 日志以及 zk 日志发现的原因。
> >
> > Best,
> > Congxian
> >
> >
> > SmileSmile <[hidden email]> 于2020年7月15日周三 下午5:20写道:
> >
> > > Hi Roc
> > >
> > > 该现象在1.10.1版本没有,在1.11版本才出现。请问这个该如何查比较合适
> > >
> > >
> > >
> > > | |
> > > a511955993
> > > |
> > > |
> > > 邮箱:[hidden email]
> > > |
> > >
> > > 签名由 网易邮箱大师 定制
> > >
> > > On 07/15/2020 17:16, Roc Marshal wrote:
> > > Hi,SmileSmile.
> > > 个人之前有遇到过 类似 的host解析问题,可以从k8s的pod节点网络映射角度排查一下。
> > > 希望这对你有帮助。
> > >
> > >
> > > 祝好。
> > > Roc Marshal
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > 在 2020-07-15 17:04:18,"SmileSmile" <[hidden email]> 写道:
> > > >
> > > >Hi
> > > >
> > > >使用版本Flink 1.11,部署方式 kubernetes session。 TM个数30个,每个TM 4个slot。 job
> > > 并行度120.提交作业的时候出现大量的No hostname could be resolved for the IP address,JM
> > time
> > > out,作业提交失败。web ui也会卡主无响应。
> > > >
> > > >用wordCount,并行度只有1提交也会刷,no hostname的日志会刷个几条,然后正常提交,如果并行度一上去,就会超时。
> > > >
> > > >
> > > >部分日志如下:
> > > >
> > > >2020-07-15 16:58:46,460 WARN
> > > org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> > > hostname could be resolved for the IP address 10.32.160.7, using IP
> > address
> > > as host name. Local input split assignment (such as for HDFS files) may
> > be
> > > impacted.
> > > >2020-07-15 16:58:46,460 WARN
> > > org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> > > hostname could be resolved for the IP address 10.44.224.7, using IP
> > address
> > > as host name. Local input split assignment (such as for HDFS files) may
> > be
> > > impacted.
> > > >2020-07-15 16:58:46,461 WARN
> > > org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> > > hostname could be resolved for the IP address 10.40.32.9, using IP
> > address
> > > as host name. Local input split assignment (such as for HDFS files) may
> > be
> > > impacted.
> > > >
> > > >2020-07-15 16:59:10,236 INFO
> > > org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> > The
> > > heartbeat of JobManager with id 69a0d460de468888a9f41c770d963c0a timed
> > out.
> > > >2020-07-15 16:59:10,236 INFO
> > > org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> > > Disconnect job manager 00000000000000000000000000000000
> > > @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> > > e1554c737e37ed79688a15c746b6e9ef from the resource manager.
> > > >
> > > >
> > > >how to deal with ?
> > > >
> > > >
> > > >beset !
> > > >
> > > >| |
> > > >a511955993
> > > >|
> > > >|
> > > >邮箱:[hidden email]
> > > >|
> > > >
> > > >签名由 网易邮箱大师 定制
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.11 submit job timed out

Yang Wang
我的意思就是你在Flink任务运行的过程中,然后下面的命令在集群里面起一个busybox的pod,
在里面执行 nslookup {ip_address},看看是否能够正常解析到。如果不能应该就是coredns的
问题了

kubectl run -i -t busybox --image=busybox --restart=Never

你需要确认下集群的coredns pod是否正常,一般是部署在kube-system这个namespace下的



Best,
Yang


SmileSmile <[hidden email]> 于2020年7月22日周三 下午7:57写道:

>
> Hi,Yang Wang!
>
> 很开心可以收到你的回复,你的回复帮助很大,让我知道了问题的方向。我再补充些信息,希望可以帮我进一步判断一下问题根源。
>
> 在JM报错的地方,No hostname could be resolved for ip address xxxxx
> ,报出来的ip是k8s分配给flink pod的内网ip,不是宿主机的ip。请问这个问题可能出在哪里呢
>
> Best!
>
>
> a511955993
> 邮箱:[hidden email]
>
> <https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=a511955993&uid=a511955993%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22%E9%82%AE%E7%AE%B1%EF%BC%9Aa511955993%40163.com%22%5D>
>
> 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail88> 定制
>
> On 07/22/2020 18:18, Yang Wang <[hidden email]> wrote:
> 如果你的日志里面一直在刷No hostname could be resolved for the IP address,应该是集群的coredns
> 有问题,由ip地址反查hostname查不到。你可以起一个busybox验证一下是不是这个ip就解析不了,有
> 可能是coredns有问题
>
>
> Best,
> Yang
>
> Congxian Qiu <[hidden email]> 于2020年7月21日周二 下午7:29写道:
>
> > Hi
> >    不确定 k8s 环境中能否看到 pod 的完整日志?类似 Yarn 的 NM 日志一样,如果有的话,可以尝试看一下这个 pod
> > 的完整日志有没有什么发现
> > Best,
> > Congxian
> >
> >
> > SmileSmile <[hidden email]> 于2020年7月21日周二 下午3:19写道:
> >
> > > Hi,Congxian
> > >
> > > 因为是测试环境,没有配置HA,目前看到的信息,就是JM刷出来大量的no hostname could be
> > > resolved,jm失联,作业提交失败。
> > > 将jm内存配置为10g也是一样的情况(jobmanager.memory.pprocesa.size:10240m)。
> > >
> > > 在同一个环境将版本回退到1.10没有出现该问题,也不会刷如上报错。
> > >
> > >
> > > 是否有其他排查思路?
> > >
> > > Best!
> > >
> > >
> > >
> > >
> > > | |
> > > a511955993
> > > |
> > > |
> > > 邮箱:[hidden email]
> > > |
> > >
> > > 签名由 网易邮箱大师 定制
> > >
> > > On 07/16/2020 13:17, Congxian Qiu wrote:
> > > Hi
> > >   如果没有异常,GC 情况也正常的话,或许可以看一下 pod 的相关日志,如果开启了 HA 也可以看一下 zk 的日志。之前遇到过一次在
> > Yarn
> > > 环境中类似的现象是由于其他原因导致的,通过看 NM 日志以及 zk 日志发现的原因。
> > >
> > > Best,
> > > Congxian
> > >
> > >
> > > SmileSmile <[hidden email]> 于2020年7月15日周三 下午5:20写道:
> > >
> > > > Hi Roc
> > > >
> > > > 该现象在1.10.1版本没有,在1.11版本才出现。请问这个该如何查比较合适
> > > >
> > > >
> > > >
> > > > | |
> > > > a511955993
> > > > |
> > > > |
> > > > 邮箱:[hidden email]
> > > > |
> > > >
> > > > 签名由 网易邮箱大师 定制
> > > >
> > > > On 07/15/2020 17:16, Roc Marshal wrote:
> > > > Hi,SmileSmile.
> > > > 个人之前有遇到过 类似 的host解析问题,可以从k8s的pod节点网络映射角度排查一下。
> > > > 希望这对你有帮助。
> > > >
> > > >
> > > > 祝好。
> > > > Roc Marshal
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > 在 2020-07-15 17:04:18,"SmileSmile" <[hidden email]> 写道:
> > > > >
> > > > >Hi
> > > > >
> > > > >使用版本Flink 1.11,部署方式 kubernetes session。 TM个数30个,每个TM 4个slot。 job
> > > > 并行度120.提交作业的时候出现大量的No hostname could be resolved for the IP
> address,JM
> > > time
> > > > out,作业提交失败。web ui也会卡主无响应。
> > > > >
> > > > >用wordCount,并行度只有1提交也会刷,no hostname的日志会刷个几条,然后正常提交,如果并行度一上去,就会超时。
> > > > >
> > > > >
> > > > >部分日志如下:
> > > > >
> > > > >2020-07-15 16:58:46,460 WARN
> > > > org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> > > > hostname could be resolved for the IP address 10.32.160.7, using IP
> > > address
> > > > as host name. Local input split assignment (such as for HDFS files)
> may
> > > be
> > > > impacted.
> > > > >2020-07-15 16:58:46,460 WARN
> > > > org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> > > > hostname could be resolved for the IP address 10.44.224.7, using IP
> > > address
> > > > as host name. Local input split assignment (such as for HDFS files)
> may
> > > be
> > > > impacted.
> > > > >2020-07-15 16:58:46,461 WARN
> > > > org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> > > > hostname could be resolved for the IP address 10.40.32.9, using IP
> > > address
> > > > as host name. Local input split assignment (such as for HDFS files)
> may
> > > be
> > > > impacted.
> > > > >
> > > > >2020-07-15 16:59:10,236 INFO
> > > > org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
> [] -
> > > The
> > > > heartbeat of JobManager with id 69a0d460de468888a9f41c770d963c0a
> timed
> > > out.
> > > > >2020-07-15 16:59:10,236 INFO
> > > > org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
> [] -
> > > > Disconnect job manager 00000000000000000000000000000000
> > > > @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for
> job
> > > > e1554c737e37ed79688a15c746b6e9ef from the resource manager.
> > > > >
> > > > >
> > > > >how to deal with ?
> > > > >
> > > > >
> > > > >beset !
> > > > >
> > > > >| |
> > > > >a511955993
> > > > >|
> > > > >|
> > > > >邮箱:[hidden email]
> > > > >|
> > > > >
> > > > >签名由 网易邮箱大师 定制
> > > >
> > >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.11 submit job timed out

Chris Guo

Hi Yang Wang

刚刚在测试环境测试了一下,taskManager没有办法nslookup出来,JM可以nslookup,这两者的差别在于是否有service。

解决方案:我这边给集群加上了taskmanager-query-state-service.yaml(按照官网上是可选服务)。就不会刷No hostname could be resolved for ip address,将NodePort改为ClusterIp,作业就可以成功提交,不会出现time out的问题了,问题得到了解决。


1. 如果按照上面的情况,那么这个配置文件是必须配置的?

2. 在1.11的更新中,发现有 [Flink-15911][Flink-15154] 支持分别配置用于本地监听绑定的网络接口和外部访问的地址和端口。是否是这块的改动,
需要JM去通过TM上报的ip反向解析出service?


Bset!


[1]https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html


| |
a511955993
|
|
邮箱:[hidden email]
|

签名由 网易邮箱大师 定制

On 07/23/2020 10:11, Yang Wang wrote:
我的意思就是你在Flink任务运行的过程中,然后下面的命令在集群里面起一个busybox的pod,
在里面执行 nslookup {ip_address},看看是否能够正常解析到。如果不能应该就是coredns的
问题了

kubectl run -i -t busybox --image=busybox --restart=Never

你需要确认下集群的coredns pod是否正常,一般是部署在kube-system这个namespace下的



Best,
Yang


SmileSmile <[hidden email]> 于2020年7月22日周三 下午7:57写道:

>
> Hi,Yang Wang!
>
> 很开心可以收到你的回复,你的回复帮助很大,让我知道了问题的方向。我再补充些信息,希望可以帮我进一步判断一下问题根源。
>
> 在JM报错的地方,No hostname could be resolved for ip address xxxxx
> ,报出来的ip是k8s分配给flink pod的内网ip,不是宿主机的ip。请问这个问题可能出在哪里呢
>
> Best!
>
>
> a511955993
> 邮箱:[hidden email]
>
> <https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=a511955993&uid=a511955993%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22%E9%82%AE%E7%AE%B1%EF%BC%9Aa511955993%40163.com%22%5D&gt;
>
> 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail88&gt; 定制
>
> On 07/22/2020 18:18, Yang Wang <[hidden email]> wrote:
> 如果你的日志里面一直在刷No hostname could be resolved for the IP address,应该是集群的coredns
> 有问题,由ip地址反查hostname查不到。你可以起一个busybox验证一下是不是这个ip就解析不了,有
> 可能是coredns有问题
>
>
> Best,
> Yang
>
> Congxian Qiu <[hidden email]> 于2020年7月21日周二 下午7:29写道:
>
> > Hi
> >    不确定 k8s 环境中能否看到 pod 的完整日志?类似 Yarn 的 NM 日志一样,如果有的话,可以尝试看一下这个 pod
> > 的完整日志有没有什么发现
> > Best,
> > Congxian
> >
> >
> > SmileSmile <[hidden email]> 于2020年7月21日周二 下午3:19写道:
> >
> > > Hi,Congxian
> > >
> > > 因为是测试环境,没有配置HA,目前看到的信息,就是JM刷出来大量的no hostname could be
> > > resolved,jm失联,作业提交失败。
> > > 将jm内存配置为10g也是一样的情况(jobmanager.memory.pprocesa.size:10240m)。
> > >
> > > 在同一个环境将版本回退到1.10没有出现该问题,也不会刷如上报错。
> > >
> > >
> > > 是否有其他排查思路?
> > >
> > > Best!
> > >
> > >
> > >
> > >
> > > | |
> > > a511955993
> > > |
> > > |
> > > 邮箱:[hidden email]
> > > |
> > >
> > > 签名由 网易邮箱大师 定制
> > >
> > > On 07/16/2020 13:17, Congxian Qiu wrote:
> > > Hi
> > >   如果没有异常,GC 情况也正常的话,或许可以看一下 pod 的相关日志,如果开启了 HA 也可以看一下 zk 的日志。之前遇到过一次在
> > Yarn
> > > 环境中类似的现象是由于其他原因导致的,通过看 NM 日志以及 zk 日志发现的原因。
> > >
> > > Best,
> > > Congxian
> > >
> > >
> > > SmileSmile <[hidden email]> 于2020年7月15日周三 下午5:20写道:
> > >
> > > > Hi Roc
> > > >
> > > > 该现象在1.10.1版本没有,在1.11版本才出现。请问这个该如何查比较合适
> > > >
> > > >
> > > >
> > > > | |
> > > > a511955993
> > > > |
> > > > |
> > > > 邮箱:[hidden email]
> > > > |
> > > >
> > > > 签名由 网易邮箱大师 定制
> > > >
> > > > On 07/15/2020 17:16, Roc Marshal wrote:
> > > > Hi,SmileSmile.
> > > > 个人之前有遇到过 类似 的host解析问题,可以从k8s的pod节点网络映射角度排查一下。
> > > > 希望这对你有帮助。
> > > >
> > > >
> > > > 祝好。
> > > > Roc Marshal
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > 在 2020-07-15 17:04:18,"SmileSmile" <[hidden email]> 写道:
> > > > >
> > > > >Hi
> > > > >
> > > > >使用版本Flink 1.11,部署方式 kubernetes session。 TM个数30个,每个TM 4个slot。 job
> > > > 并行度120.提交作业的时候出现大量的No hostname could be resolved for the IP
> address,JM
> > > time
> > > > out,作业提交失败。web ui也会卡主无响应。
> > > > >
> > > > >用wordCount,并行度只有1提交也会刷,no hostname的日志会刷个几条,然后正常提交,如果并行度一上去,就会超时。
> > > > >
> > > > >
> > > > >部分日志如下:
> > > > >
> > > > >2020-07-15 16:58:46,460 WARN
> > > > org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> > > > hostname could be resolved for the IP address 10.32.160.7, using IP
> > > address
> > > > as host name. Local input split assignment (such as for HDFS files)
> may
> > > be
> > > > impacted.
> > > > >2020-07-15 16:58:46,460 WARN
> > > > org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> > > > hostname could be resolved for the IP address 10.44.224.7, using IP
> > > address
> > > > as host name. Local input split assignment (such as for HDFS files)
> may
> > > be
> > > > impacted.
> > > > >2020-07-15 16:58:46,461 WARN
> > > > org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> > > > hostname could be resolved for the IP address 10.40.32.9, using IP
> > > address
> > > > as host name. Local input split assignment (such as for HDFS files)
> may
> > > be
> > > > impacted.
> > > > >
> > > > >2020-07-15 16:59:10,236 INFO
> > > > org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
> [] -
> > > The
> > > > heartbeat of JobManager with id 69a0d460de468888a9f41c770d963c0a
> timed
> > > out.
> > > > >2020-07-15 16:59:10,236 INFO
> > > > org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
> [] -
> > > > Disconnect job manager 00000000000000000000000000000000
> > > > @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for
> job
> > > > e1554c737e37ed79688a15c746b6e9ef from the resource manager.
> > > > >
> > > > >
> > > > >how to deal with ?
> > > > >
> > > > >
> > > > >beset !
> > > > >
> > > > >| |
> > > > >a511955993
> > > > >|
> > > > >|
> > > > >邮箱:[hidden email]
> > > > >|
> > > > >
> > > > >签名由 网易邮箱大师 定制
> > > >
> > >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.11 submit job timed out

Yang Wang
很高兴你的问题解决了,但我觉得根本原因应该不是加上了taskmanager-query-state-service.yaml的关系。
我这边不创建这个服务也是正常的,而且nslookup {tm_ip_address}是可以正常反解析到hostname的。

注意这里不是解析hostname,而是通过ip地址来反解析进行验证


回答你说的两个问题:
1. 不是必须的,我这边验证不需要创建,集群也是可以正常运行任务的。Rest
service的暴露方式是ClusterIP、NodePort、LoadBalancer都正常
2. 如果没有配置taskmanager.bind-host,
[Flink-15911][Flink-15154]这两个JIRA并不会影响TM向RM注册时候的使用的地址

如果你想找到根本原因,那可能需要你这边提供JM/TM的完整log,这样方便分析


Best,
Yang

SmileSmile <[hidden email]> 于2020年7月23日周四 上午11:30写道:

>
> Hi Yang Wang
>
> 刚刚在测试环境测试了一下,taskManager没有办法nslookup出来,JM可以nslookup,这两者的差别在于是否有service。
>
> 解决方案:我这边给集群加上了taskmanager-query-state-service.yaml(按照官网上是可选服务)。就不会刷No
> hostname could be resolved for ip
> address,将NodePort改为ClusterIp,作业就可以成功提交,不会出现time out的问题了,问题得到了解决。
>
>
> 1. 如果按照上面的情况,那么这个配置文件是必须配置的?
>
> 2. 在1.11的更新中,发现有 [Flink-15911][Flink-15154]
> 支持分别配置用于本地监听绑定的网络接口和外部访问的地址和端口。是否是这块的改动,
> 需要JM去通过TM上报的ip反向解析出service?
>
>
> Bset!
>
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html
>
> a511955993
> 邮箱:[hidden email]
>
> <https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=a511955993&uid=a511955993%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22%E9%82%AE%E7%AE%B1%EF%BC%9Aa511955993%40163.com%22%5D>
>
> 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail88> 定制
>
> On 07/23/2020 10:11, Yang Wang <[hidden email]> wrote:
> 我的意思就是你在Flink任务运行的过程中,然后下面的命令在集群里面起一个busybox的pod,
> 在里面执行 nslookup {ip_address},看看是否能够正常解析到。如果不能应该就是coredns的
> 问题了
>
> kubectl run -i -t busybox --image=busybox --restart=Never
>
> 你需要确认下集群的coredns pod是否正常,一般是部署在kube-system这个namespace下的
>
>
>
> Best,
> Yang
>
>
> SmileSmile <[hidden email]> 于2020年7月22日周三 下午7:57写道:
>
> >
> > Hi,Yang Wang!
> >
> > 很开心可以收到你的回复,你的回复帮助很大,让我知道了问题的方向。我再补充些信息,希望可以帮我进一步判断一下问题根源。
> >
> > 在JM报错的地方,No hostname could be resolved for ip address xxxxx
> > ,报出来的ip是k8s分配给flink pod的内网ip,不是宿主机的ip。请问这个问题可能出在哪里呢
> >
> > Best!
> >
> >
> > a511955993
> > 邮箱:[hidden email]
> >
> > <
> https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=a511955993&uid=a511955993%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22%E9%82%AE%E7%AE%B1%EF%BC%9Aa511955993%40163.com%22%5D&gt;
>
> >
> > 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail88&gt; 定制
> >
> > On 07/22/2020 18:18, Yang Wang <[hidden email]> wrote:
> > 如果你的日志里面一直在刷No hostname could be resolved for the IP
> address,应该是集群的coredns
> > 有问题,由ip地址反查hostname查不到。你可以起一个busybox验证一下是不是这个ip就解析不了,有
> > 可能是coredns有问题
> >
> >
> > Best,
> > Yang
> >
> > Congxian Qiu <[hidden email]> 于2020年7月21日周二 下午7:29写道:
> >
> > > Hi
> > >    不确定 k8s 环境中能否看到 pod 的完整日志?类似 Yarn 的 NM 日志一样,如果有的话,可以尝试看一下这个 pod
> > > 的完整日志有没有什么发现
> > > Best,
> > > Congxian
> > >
> > >
> > > SmileSmile <[hidden email]> 于2020年7月21日周二 下午3:19写道:
> > >
> > > > Hi,Congxian
> > > >
> > > > 因为是测试环境,没有配置HA,目前看到的信息,就是JM刷出来大量的no hostname could be
> > > > resolved,jm失联,作业提交失败。
> > > > 将jm内存配置为10g也是一样的情况(jobmanager.memory.pprocesa.size:10240m)。
> > > >
> > > > 在同一个环境将版本回退到1.10没有出现该问题,也不会刷如上报错。
> > > >
> > > >
> > > > 是否有其他排查思路?
> > > >
> > > > Best!
> > > >
> > > >
> > > >
> > > >
> > > > | |
> > > > a511955993
> > > > |
> > > > |
> > > > 邮箱:[hidden email]
> > > > |
> > > >
> > > > 签名由 网易邮箱大师 定制
> > > >
> > > > On 07/16/2020 13:17, Congxian Qiu wrote:
> > > > Hi
> > > >   如果没有异常,GC 情况也正常的话,或许可以看一下 pod 的相关日志,如果开启了 HA 也可以看一下 zk
> 的日志。之前遇到过一次在
> > > Yarn
> > > > 环境中类似的现象是由于其他原因导致的,通过看 NM 日志以及 zk 日志发现的原因。
> > > >
> > > > Best,
> > > > Congxian
> > > >
> > > >
> > > > SmileSmile <[hidden email]> 于2020年7月15日周三 下午5:20写道:
> > > >
> > > > > Hi Roc
> > > > >
> > > > > 该现象在1.10.1版本没有,在1.11版本才出现。请问这个该如何查比较合适
> > > > >
> > > > >
> > > > >
> > > > > | |
> > > > > a511955993
> > > > > |
> > > > > |
> > > > > 邮箱:[hidden email]
> > > > > |
> > > > >
> > > > > 签名由 网易邮箱大师 定制
> > > > >
> > > > > On 07/15/2020 17:16, Roc Marshal wrote:
> > > > > Hi,SmileSmile.
> > > > > 个人之前有遇到过 类似 的host解析问题,可以从k8s的pod节点网络映射角度排查一下。
> > > > > 希望这对你有帮助。
> > > > >
> > > > >
> > > > > 祝好。
> > > > > Roc Marshal
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > 在 2020-07-15 17:04:18,"SmileSmile" <[hidden email]> 写道:
> > > > > >
> > > > > >Hi
> > > > > >
> > > > > >使用版本Flink 1.11,部署方式 kubernetes session。 TM个数30个,每个TM 4个slot。 job
> > > > > 并行度120.提交作业的时候出现大量的No hostname could be resolved for the IP
> > address,JM
> > > > time
> > > > > out,作业提交失败。web ui也会卡主无响应。
> > > > > >
> > > > > >用wordCount,并行度只有1提交也会刷,no hostname的日志会刷个几条,然后正常提交,如果并行度一上去,就会超时。
> > > > > >
> > > > > >
> > > > > >部分日志如下:
> > > > > >
> > > > > >2020-07-15 16:58:46,460 WARN
> > > > > org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] -
> No
> > > > > hostname could be resolved for the IP address 10.32.160.7, using
> IP
> > > > address
> > > > > as host name. Local input split assignment (such as for HDFS
> files)
> > may
> > > > be
> > > > > impacted.
> > > > > >2020-07-15 16:58:46,460 WARN
> > > > > org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] -
> No
> > > > > hostname could be resolved for the IP address 10.44.224.7, using
> IP
> > > > address
> > > > > as host name. Local input split assignment (such as for HDFS
> files)
> > may
> > > > be
> > > > > impacted.
> > > > > >2020-07-15 16:58:46,461 WARN
> > > > > org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] -
> No
> > > > > hostname could be resolved for the IP address 10.40.32.9, using IP
> > > > address
> > > > > as host name. Local input split assignment (such as for HDFS
> files)
> > may
> > > > be
> > > > > impacted.
> > > > > >
> > > > > >2020-07-15 16:59:10,236 INFO
> > > > > org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
> > [] -
> > > > The
> > > > > heartbeat of JobManager with id 69a0d460de468888a9f41c770d963c0a
> > timed
> > > > out.
> > > > > >2020-07-15 16:59:10,236 INFO
> > > > > org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
> > [] -
> > > > > Disconnect job manager 00000000000000000000000000000000
> > > > > @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for
> > job
> > > > > e1554c737e37ed79688a15c746b6e9ef from the resource manager.
> > > > > >
> > > > > >
> > > > > >how to deal with ?
> > > > > >
> > > > > >
> > > > > >beset !
> > > > > >
> > > > > >| |
> > > > > >a511955993
> > > > > >|
> > > > > >|
> > > > > >邮箱:[hidden email]
> > > > > >|
> > > > > >
> > > > > >签名由 网易邮箱大师 定制
> > > > >
> > > >
> > >
> >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.11 submit job timed out

Chris Guo
Hi Yang Wang

先分享下我这边的环境版本


kubernetes:1.17.4.   CNI: weave  


1 2 3 是我的一些疑惑

4 是JM日志


1. 去掉taskmanager-query-state-service.yaml后确实不行  nslookup

kubectl exec -it busybox2 -- /bin/sh
/ # nslookup 10.47.96.2
Server:          10.96.0.10
Address:     10.96.0.10:53

** server can't find 2.96.47.10.in-addr.arpa: NXDOMAIN



2. Flink1.11和Flink1.10

detail subtasks taskmanagers xxx x 这行  1.11变成了172-20-0-50。1.10是flink-taskmanager-7b5d6958b6-sfzlk:36459。这块的改动是?(目前这个集群跑着1.10和1.11,1.10可以正常运行,如果coredns有问题,1.10版本的flink应该也有一样的情况吧?)

3. coredns是否特殊配置?

在容器中解析域名是正常的,只是反向解析没有service才会有问题。coredns是否有什么需要配置?


4. time out时候的JM日志如下:



2020-07-23 13:53:00,228 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - ResourceManager akka.tcp://flink@flink-jobmanager:6123/user/rpc/resourcemanager_0 was granted leadership with fencing token 00000000000000000000000000000000
2020-07-23 13:53:00,232 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService             [] - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at akka://flink/user/rpc/dispatcher_1 .
2020-07-23 13:53:00,233 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl [] - Starting the SlotManager.
2020-07-23 13:53:03,472 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registering TaskManager with ResourceID 1f9ae0cd95a28943a73be26323588696 (akka.tcp://flink@10.34.128.9:6122/user/rpc/taskmanager_0) at ResourceManager
2020-07-23 13:53:03,777 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registering TaskManager with ResourceID cac09e751264e61615329c20713a84b4 (akka.tcp://flink@10.32.160.6:6122/user/rpc/taskmanager_0) at ResourceManager
2020-07-23 13:53:03,787 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registering TaskManager with ResourceID 93c72d01d09f9ae427c5fc980ed4c1e4 (akka.tcp://flink@10.39.0.8:6122/user/rpc/taskmanager_0) at ResourceManager
2020-07-23 13:53:04,044 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registering TaskManager with ResourceID 8adf2f8e81b77a16d5418a9e252c61e2 (akka.tcp://flink@10.38.64.7:6122/user/rpc/taskmanager_0) at ResourceManager
2020-07-23 13:53:04,099 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registering TaskManager with ResourceID 23e9d2358f6eb76b9ae718d879d4f330 (akka.tcp://flink@10.42.160.6:6122/user/rpc/taskmanager_0) at ResourceManager
2020-07-23 13:53:04,146 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registering TaskManager with ResourceID 092f8dee299e32df13db3111662b61f8 (akka.tcp://flink@10.33.192.14:6122/user/rpc/taskmanager_0) at ResourceManager


2020-07-23 13:55:44,220 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Received JobGraph submission 99a030d0e3f428490a501c0132f27a56 (JobTest).
2020-07-23 13:55:44,222 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Submitting job 99a030d0e3f428490a501c0132f27a56 (JobTest).
2020-07-23 13:55:44,251 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService             [] - Starting RPC endpoint for org.apache.flink.runtime.jobmaster.JobMaster at akka://flink/user/rpc/jobmanager_2 .
2020-07-23 13:55:44,260 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Initializing job JobTest (99a030d0e3f428490a501c0132f27a56).
2020-07-23 13:55:44,278 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Using restart back off time strategy NoRestartBackoffTimeStrategy for JobTest (99a030d0e3f428490a501c0132f27a56).
2020-07-23 13:55:44,319 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Running initialization on master for job JobTest (99a030d0e3f428490a501c0132f27a56).
2020-07-23 13:55:44,319 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Successfully ran initialization on master in 0 ms.
2020-07-23 13:55:44,428 INFO  org.apache.flink.runtime.scheduler.adapter.DefaultExecutionTopology [] - Built 1 pipelined regions in 25 ms
2020-07-23 13:55:44,437 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Loading state backend via factory org.apache.flink.contrib.streaming.state.RocksDBStateBackendFactory
2020-07-23 13:55:44,456 INFO  org.apache.flink.contrib.streaming.state.RocksDBStateBackend [] - Using predefined options: DEFAULT.
2020-07-23 13:55:44,457 INFO  org.apache.flink.contrib.streaming.state.RocksDBStateBackend [] - Using default options factory: DefaultConfigurableOptionsFactory{configuredOptions={}}.
2020-07-23 13:55:44,466 WARN  org.apache.flink.runtime.util.HadoopUtils                    [] - Could not find Hadoop configuration via any of the supported methods (Flink configuration, environment variables).
2020-07-23 13:55:45,276 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Using failover strategy org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@72bd8533 for JobTest (99a030d0e3f428490a501c0132f27a56).
2020-07-23 13:55:45,280 INFO  org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl      [] - JobManager runner for job JobTest (99a030d0e3f428490a501c0132f27a56) was granted leadership with session id 00000000-0000-0000-0000-000000000000 at akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2.
2020-07-23 13:55:45,286 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Starting scheduling with scheduling strategy [org.apache.flink.runtime.scheduler.strategy.EagerSchedulingStrategy]



2020-07-23 13:55:45,436 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{e092b12b96b0a98bbf057e71b9705c23}]
2020-07-23 13:55:45,436 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{4ad15f417716c9e07fca383990c0f52a}]
2020-07-23 13:55:45,436 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{345fdb427a893b7fc3f4f040f93445d2}]
2020-07-23 13:55:45,437 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{e559485ea7b0b7e17367816882538d90}]
2020-07-23 13:55:45,437 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{7be8f6c1aedb27b04e7feae68078685c}]
2020-07-23 13:55:45,437 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{582a86197884206652dff3aea2306bb3}]
2020-07-23 13:55:45,437 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{0cc24260eda3af299a0b321feefaf2cb}]
2020-07-23 13:55:45,437 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{240ca6f3d3b5ece6a98243ec8cadf616}]
2020-07-23 13:55:45,438 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{c35033d598a517acc108424bb9f809fb}]
2020-07-23 13:55:45,438 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{ad35013c3b532d4b4df1be62395ae0cf}]
2020-07-23 13:55:45,438 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{c929bd5e8daf432d01fad1ece3daec1a}]
2020-07-23 13:55:45,487 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Connecting to ResourceManager akka.tcp://flink@flink-jobmanager:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000)
2020-07-23 13:55:45,492 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Resolved ResourceManager address, beginning registration
2020-07-23 13:55:45,493 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registering job manager [hidden email]://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job 99a030d0e3f428490a501c0132f27a56.
2020-07-23 13:55:45,499 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registered job manager [hidden email]://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job 99a030d0e3f428490a501c0132f27a56.
2020-07-23 13:55:45,501 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - JobManager successfully registered at ResourceManager, leader id: 00000000000000000000000000000000.
2020-07-23 13:55:45,501 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Requesting new slot [SlotRequestId{15fd2a9565c2b080748c1d1592b1cbbc}] and profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,502 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Request slot with profile ResourceProfile{UNKNOWN} for job 99a030d0e3f428490a501c0132f27a56 with allocation id d420d08bf2654d9ea76955c70db18b69.
2020-07-23 13:55:45,502 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Requesting new slot [SlotRequestId{8cd72cc16f0e319d915a9a096a1096d7}] and profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,503 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Requesting new slot [SlotRequestId{e7e422409acebdb385014a9634af6a90}] and profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,503 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Requesting new slot [SlotRequestId{cef1af73546ca1fc27ca7a3322e9e815}] and profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,503 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Requesting new slot [SlotRequestId{108fe0b3086567ad79275eccef2fdaf8}] and profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,503 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Requesting new slot [SlotRequestId{265e67985eab7a6dc08024e53bf2708d}] and profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,503 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Requesting new slot [SlotRequestId{7087497a17c441f1a1d6fefcbc7cd0ea}] and profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,503 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Requesting new slot [SlotRequestId{14ac08438e79c8db8d25d93b99d62725}] and profile ResourceProfile{UNKNOWN} from resource manager.

2020-07-23 13:55:45,514 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Request slot with profile ResourceProfile{UNKNOWN} for job 99a030d0e3f428490a501c0132f27a56 with allocation id fce526bbe3e1be91caa3e4b536b20e35.
2020-07-23 13:55:45,514 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Requesting new slot [SlotRequestId{40c7abbb12514c405323b0569fb21647}] and profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,514 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Requesting new slot [SlotRequestId{a4985a9647b65b30a571258b45c8f2ce}] and profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,515 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Requesting new slot [SlotRequestId{c52a6eb2fa58050e71e7903590019fd1}] and profile ResourceProfile{UNKNOWN} from resource manager.

2020-07-23 13:55:45,517 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Request slot with profile ResourceProfile{UNKNOWN} for job 99a030d0e3f428490a501c0132f27a56 with allocation id 18ac7ec802ebfcfed8c05ee9324a55a4.

2020-07-23 13:55:45,518 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Request slot with profile ResourceProfile{UNKNOWN} for job 99a030d0e3f428490a501c0132f27a56 with allocation id 7ec76cbe689eb418b63599e90ade19be.
2020-07-23 13:55:45,518 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Requesting new slot [SlotRequestId{46d65692a8b5aad11b51f9a74a666a74}] and profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,518 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Requesting new slot [SlotRequestId{3670bb4f345eedf941cc18e477ba1e9d}] and profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,518 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Requesting new slot [SlotRequestId{4a12467d76b9e3df8bc3412c0be08e14}] and profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,518 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Requesting new slot [SlotRequestId{e092b12b96b0a98bbf057e71b9705c23}] and profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,518 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Requesting new slot [SlotRequestId{4ad15f417716c9e07fca383990c0f52a}] and profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,518 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Requesting new slot [SlotRequestId{345fdb427a893b7fc3f4f040f93445d2}] and profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,519 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Requesting new slot [SlotRequestId{e559485ea7b0b7e17367816882538d90}] and profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,519 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Request slot with profile ResourceProfile{UNKNOWN} for job 99a030d0e3f428490a501c0132f27a56 with allocation id b78837a29b4032924ac25be70ed21a3c.


2020-07-23 13:58:18,037 WARN  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No hostname could be resolved for the IP address 10.47.96.2, using IP address as host name. Local input split assignment (such as for HDFS files) may be impacted.
2020-07-23 13:58:22,192 WARN  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No hostname could be resolved for the IP address 10.34.64.14, using IP address as host name. Local input split assignment (such as for HDFS files) may be impacted.
2020-07-23 13:58:22,358 WARN  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No hostname could be resolved for the IP address 10.34.128.9, using IP address as host name. Local input split assignment (such as for HDFS files) may be impacted.
2020-07-23 13:58:24,562 WARN  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No hostname could be resolved for the IP address 10.32.160.6, using IP address as host name. Local input split assignment (such as for HDFS files) may be impacted.
2020-07-23 13:58:25,487 WARN  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No hostname could be resolved for the IP address 10.38.64.7, using IP address as host name. Local input split assignment (such as for HDFS files) may be impacted.
2020-07-23 13:58:27,636 WARN  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No hostname could be resolved for the IP address 10.42.160.6, using IP address as host name. Local input split assignment (such as for HDFS files) may be impacted.
2020-07-23 13:58:27,767 WARN  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No hostname could be resolved for the IP address 10.43.64.12, using IP address as host name. Local input split assignment (such as for HDFS files) may be impacted.
2020-07-23 13:58:29,651 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - The heartbeat of JobManager with id 456a18b6c404cb11a359718e16de1c6b timed out.
2020-07-23 13:58:29,651 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Disconnect job manager [hidden email]://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job 99a030d0e3f428490a501c0132f27a56 from the resource manager.
2020-07-23 13:58:29,854 WARN  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No hostname could be resolved for the IP address 10.39.0.8, using IP address as host name. Local input split assignment (such as for HDFS files) may be impacted.
2020-07-23 13:58:33,623 WARN  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No hostname could be resolved for the IP address 10.35.0.10, using IP address as host name. Local input split assignment (such as for HDFS files) may be impacted.
2020-07-23 13:58:35,756 WARN  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No hostname could be resolved for the IP address 10.36.32.8, using IP address as host name. Local input split assignment (such as for HDFS files) may be impacted.
2020-07-23 13:58:36,694 WARN  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No hostname could be resolved for the IP address 10.42.128.6, using IP address as host name. Local input split assignment (such as for HDFS files) may be impacted.


2020-07-23 14:01:17,814 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Close ResourceManager connection 83b1ff14900abfd54418e7fa3efb3f8a: The heartbeat of JobManager with id 456a18b6c404cb11a359718e16de1c6b timed out..
2020-07-23 14:01:17,815 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Connecting to ResourceManager akka.tcp://flink@flink-jobmanager:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000)
2020-07-23 14:01:17,816 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Resolved ResourceManager address, beginning registration
2020-07-23 14:01:17,816 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registering job manager [hidden email]://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job 99a030d0e3f428490a501c0132f27a56.
2020-07-23 14:01:17,836 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: host_relation -> Timestamps/Watermarks -> Map (1/1) (302ca9640e2d209a543d843f2996ccd2) switched from SCHEDULED to FAILED on not deployed.
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate the required slot within slot request timeout. Please make sure that the cluster has enough resources.
     at org.apache.flink.runtime.scheduler.DefaultScheduler.maybeWrapWithNoResourceAvailableException(DefaultScheduler.java:441) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$assignResourceOrHandleError$6(DefaultScheduler.java:422) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_242]
     at org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:168) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_242]
     at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:726) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:537) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:432) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_242]
     at org.apache.flink.runtime.concurrent.FutureUtils.lambda$forwardTo$21(FutureUtils.java:1120) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_242]
     at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1036) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.actor.Actor$class.aroundReceive(Actor.scala:517) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.actor.ActorCell.invoke(ActorCell.scala:561) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.dispatch.Mailbox.run(Mailbox.scala:225) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.dispatch.Mailbox.exec(Mailbox.scala:235) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [flink-dist_2.11-1.11.1.jar:1.11.1]
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
     at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) ~[?:1.8.0_242]
     ... 25 more
Caused by: java.util.concurrent.TimeoutException
     ... 23 more
2020-07-23 14:01:17,848 INFO  org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy [] - Calculating tasks to restart to recover the failed task cbc357ccb763df2852fee8c4fc7d55f2_0.
2020-07-23 14:01:17,910 INFO  org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy [] - 902 tasks should be restarted to recover the failed task cbc357ccb763df2852fee8c4fc7d55f2_0.
2020-07-23 14:01:17,913 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job JobTest (99a030d0e3f428490a501c0132f27a56) switched from state RUNNING to FAILING.
org.apache.flink.runtime.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy
     at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:116) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:78) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:192) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:185) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:179) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:503) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyTaskFailure(UpdateSchedulerNgOnInternalFailuresListener.java:49) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.executiongraph.ExecutionGraph.notifySchedulerNgAboutInternalTaskFailure(ExecutionGraph.java:1710) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1287) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1255) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.executiongraph.Execution.markFailed(Execution.java:1086) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.executiongraph.ExecutionVertex.markFailed(ExecutionVertex.java:748) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.scheduler.DefaultExecutionVertexOperations.markFailed(DefaultExecutionVertexOperations.java:41) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskDeploymentFailure(DefaultScheduler.java:435) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$assignResourceOrHandleError$6(DefaultScheduler.java:422) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_242]
     at org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:168) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_242]
     at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:726) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:537) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:432) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_242]
     at org.apache.flink.runtime.concurrent.FutureUtils.lambda$forwardTo$21(FutureUtils.java:1120) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_242]
     at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1036) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.actor.Actor$class.aroundReceive(Actor.scala:517) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.actor.ActorCell.invoke(ActorCell.scala:561) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.dispatch.Mailbox.run(Mailbox.scala:225) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.dispatch.Mailbox.exec(Mailbox.scala:235) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [flink-dist_2.11-1.11.1.jar:1.11.1]
Caused by: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate the required slot within slot request timeout. Please make sure that the cluster has enough resources.
     at org.apache.flink.runtime.scheduler.DefaultScheduler.maybeWrapWithNoResourceAvailableException(DefaultScheduler.java:441) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     ... 45 more
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
     at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) ~[?:1.8.0_242]
     ... 25 more
Caused by: java.util.concurrent.TimeoutException
     ... 23 more



2020-07-23 14:01:18,109 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Discarding the results produced by task execution 1809eb912d69854f2babedeaf879df6a.
2020-07-23 14:01:18,110 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job JobTest (99a030d0e3f428490a501c0132f27a56) switched from state FAILING to FAILED.
org.apache.flink.runtime.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy
     at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:116) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:78) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:192) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:185) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:179) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:503) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyTaskFailure(UpdateSchedulerNgOnInternalFailuresListener.java:49) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.executiongraph.ExecutionGraph.notifySchedulerNgAboutInternalTaskFailure(ExecutionGraph.java:1710) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1287) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1255) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.executiongraph.Execution.markFailed(Execution.java:1086) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.executiongraph.ExecutionVertex.markFailed(ExecutionVertex.java:748) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.scheduler.DefaultExecutionVertexOperations.markFailed(DefaultExecutionVertexOperations.java:41) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskDeploymentFailure(DefaultScheduler.java:435) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$assignResourceOrHandleError$6(DefaultScheduler.java:422) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_242]
     at org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:168) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_242]
     at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:726) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:537) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:432) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_242]
     at org.apache.flink.runtime.concurrent.FutureUtils.lambda$forwardTo$21(FutureUtils.java:1120) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_242]
     at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1036) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.actor.Actor$class.aroundReceive(Actor.scala:517) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.actor.ActorCell.invoke(ActorCell.scala:561) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.dispatch.Mailbox.run(Mailbox.scala:225) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.dispatch.Mailbox.exec(Mailbox.scala:235) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [flink-dist_2.11-1.11.1.jar:1.11.1]
     at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [flink-dist_2.11-1.11.1.jar:1.11.1]
Caused by: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate the required slot within slot request timeout. Please make sure that the cluster has enough resources.
     at org.apache.flink.runtime.scheduler.DefaultScheduler.maybeWrapWithNoResourceAvailableException(DefaultScheduler.java:441) ~[flink-dist_2.11-1.11.1.jar:1.11.1]
     ... 45 more
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
     at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607) ~[?:1.8.0_242]
     at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) ~[?:1.8.0_242]
     ... 25 more
Caused by: java.util.concurrent.TimeoutException
     ... 23 more
2020-07-23 14:01:18,114 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Stopping checkpoint coordinator for job 99a030d0e3f428490a501c0132f27a56.
2020-07-23 14:01:18,117 INFO  org.apache.flink.runtime.checkpoint.StandaloneCompletedCheckpointStore [] - Shutting down
2020-07-23 14:01:18,118 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Discarding the results produced by task execution 302ca9640e2d209a543d843f2996ccd2.
2020-07-23 14:01:18,120 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Pending slot request [SlotRequestId{15fd2a9565c2b080748c1d1592b1cbbc}] timed out.
2020-07-23 14:01:18,120 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Pending slot request [SlotRequestId{8cd72cc16f0e319d915a9a096a1096d7}] timed out.
2020-07-23 14:01:18,120 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Pending slot request [SlotRequestId{e7e422409acebdb385014a9634af6a90}] timed out.
2020-07-23 14:01:18,121 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Pending slot request [SlotRequestId{cef1af73546ca1fc27ca7a3322e9e815}] timed out.
2020-07-23 14:01:18,121 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Pending slot request [SlotRequestId{108fe0b3086567ad79275eccef2fdaf8}] timed out.
2020-07-23 14:01:18,121 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Pending slot request [SlotRequestId{265e67985eab7a6dc08024e53bf2708d}] timed out.
2020-07-23 14:01:18,122 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Pending slot request [SlotRequestId{7087497a17c441f1a1d6fefcbc7cd0ea}] timed out.
2020-07-23 14:01:18,122 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Pending slot request [


2020-07-23 14:01:18,151 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registering job manager [hidden email]://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job 99a030d0e3f428490a501c0132f27a56.
2020-07-23 14:01:18,157 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registered job manager [hidden email]://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job 99a030d0e3f428490a501c0132f27a56.
2020-07-23 14:01:18,157 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registered job manager [hidden email]://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job 99a030d0e3f428490a501c0132f27a56.
2020-07-23 14:01:18,157 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Job 99a030d0e3f428490a501c0132f27a56 reached globally terminal state FAILED.
2020-07-23 14:01:18,162 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registered job manager [hidden email]://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job 99a030d0e3f428490a501c0132f27a56.
2020-07-23 14:01:18,162 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - JobManager successfully registered at ResourceManager, leader id: 00000000000000000000000000000000.
2020-07-23 14:01:18,225 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Stopping the JobMaster for job JobTest(99a030d0e3f428490a501c0132f27a56).
2020-07-23 14:01:18,381 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Suspending SlotPool.
2020-07-23 14:01:18,382 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Close ResourceManager connection 83b1ff14900abfd54418e7fa3efb3f8a: JobManager is shutting down..
2020-07-23 14:01:18,382 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Stopping SlotPool.
2020-07-23 14:01:18,382 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Disconnect job manager [hidden email]://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job 99a030d0e3f428490a501c0132f27a56 from the resource manager.


| |
a511955993
|
|
邮箱:[hidden email]
|

签名由 网易邮箱大师 定制

On 07/23/2020 13:26, Yang Wang wrote:
很高兴你的问题解决了,但我觉得根本原因应该不是加上了taskmanager-query-state-service.yaml的关系。
我这边不创建这个服务也是正常的,而且nslookup {tm_ip_address}是可以正常反解析到hostname的。

注意这里不是解析hostname,而是通过ip地址来反解析进行验证


回答你说的两个问题:
1. 不是必须的,我这边验证不需要创建,集群也是可以正常运行任务的。Rest
service的暴露方式是ClusterIP、NodePort、LoadBalancer都正常
2. 如果没有配置taskmanager.bind-host,
[Flink-15911][Flink-15154]这两个JIRA并不会影响TM向RM注册时候的使用的地址

如果你想找到根本原因,那可能需要你这边提供JM/TM的完整log,这样方便分析


Best,
Yang

SmileSmile <[hidden email]> 于2020年7月23日周四 上午11:30写道:

>
> Hi Yang Wang
>
> 刚刚在测试环境测试了一下,taskManager没有办法nslookup出来,JM可以nslookup,这两者的差别在于是否有service。
>
> 解决方案:我这边给集群加上了taskmanager-query-state-service.yaml(按照官网上是可选服务)。就不会刷No
> hostname could be resolved for ip
> address,将NodePort改为ClusterIp,作业就可以成功提交,不会出现time out的问题了,问题得到了解决。
>
>
> 1. 如果按照上面的情况,那么这个配置文件是必须配置的?
>
> 2. 在1.11的更新中,发现有 [Flink-15911][Flink-15154]
> 支持分别配置用于本地监听绑定的网络接口和外部访问的地址和端口。是否是这块的改动,
> 需要JM去通过TM上报的ip反向解析出service?
>
>
> Bset!
>
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html
>
> a511955993
> 邮箱:[hidden email]
>
> <https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=a511955993&uid=a511955993%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22%E9%82%AE%E7%AE%B1%EF%BC%9Aa511955993%40163.com%22%5D&gt;
>
> 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail88&gt; 定制
>
> On 07/23/2020 10:11, Yang Wang <[hidden email]> wrote:
> 我的意思就是你在Flink任务运行的过程中,然后下面的命令在集群里面起一个busybox的pod,
> 在里面执行 nslookup {ip_address},看看是否能够正常解析到。如果不能应该就是coredns的
> 问题了
>
> kubectl run -i -t busybox --image=busybox --restart=Never
>
> 你需要确认下集群的coredns pod是否正常,一般是部署在kube-system这个namespace下的
>
>
>
> Best,
> Yang
>
>
> SmileSmile <[hidden email]> 于2020年7月22日周三 下午7:57写道:
>
> >
> > Hi,Yang Wang!
> >
> > 很开心可以收到你的回复,你的回复帮助很大,让我知道了问题的方向。我再补充些信息,希望可以帮我进一步判断一下问题根源。
> >
> > 在JM报错的地方,No hostname could be resolved for ip address xxxxx
> > ,报出来的ip是k8s分配给flink pod的内网ip,不是宿主机的ip。请问这个问题可能出在哪里呢
> >
> > Best!
> >
> >
> > a511955993
> > 邮箱:[hidden email]
> >
> > <
> https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=a511955993&uid=a511955993%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22%E9%82%AE%E7%AE%B1%EF%BC%9Aa511955993%40163.com%22%5D&gt;
>
> >
> > 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail88&gt; 定制
> >
> > On 07/22/2020 18:18, Yang Wang <[hidden email]> wrote:
> > 如果你的日志里面一直在刷No hostname could be resolved for the IP
> address,应该是集群的coredns
> > 有问题,由ip地址反查hostname查不到。你可以起一个busybox验证一下是不是这个ip就解析不了,有
> > 可能是coredns有问题
> >
> >
> > Best,
> > Yang
> >
> > Congxian Qiu <[hidden email]> 于2020年7月21日周二 下午7:29写道:
> >
> > > Hi
> > >    不确定 k8s 环境中能否看到 pod 的完整日志?类似 Yarn 的 NM 日志一样,如果有的话,可以尝试看一下这个 pod
> > > 的完整日志有没有什么发现
> > > Best,
> > > Congxian
> > >
> > >
> > > SmileSmile <[hidden email]> 于2020年7月21日周二 下午3:19写道:
> > >
> > > > Hi,Congxian
> > > >
> > > > 因为是测试环境,没有配置HA,目前看到的信息,就是JM刷出来大量的no hostname could be
> > > > resolved,jm失联,作业提交失败。
> > > > 将jm内存配置为10g也是一样的情况(jobmanager.memory.pprocesa.size:10240m)。
> > > >
> > > > 在同一个环境将版本回退到1.10没有出现该问题,也不会刷如上报错。
> > > >
> > > >
> > > > 是否有其他排查思路?
> > > >
> > > > Best!
> > > >
> > > >
> > > >
> > > >
> > > > | |
> > > > a511955993
> > > > |
> > > > |
> > > > 邮箱:[hidden email]
> > > > |
> > > >
> > > > 签名由 网易邮箱大师 定制
> > > >
> > > > On 07/16/2020 13:17, Congxian Qiu wrote:
> > > > Hi
> > > >   如果没有异常,GC 情况也正常的话,或许可以看一下 pod 的相关日志,如果开启了 HA 也可以看一下 zk
> 的日志。之前遇到过一次在
> > > Yarn
> > > > 环境中类似的现象是由于其他原因导致的,通过看 NM 日志以及 zk 日志发现的原因。
> > > >
> > > > Best,
> > > > Congxian
> > > >
> > > >
> > > > SmileSmile <[hidden email]> 于2020年7月15日周三 下午5:20写道:
> > > >
> > > > > Hi Roc
> > > > >
> > > > > 该现象在1.10.1版本没有,在1.11版本才出现。请问这个该如何查比较合适
> > > > >
> > > > >
> > > > >
> > > > > | |
> > > > > a511955993
> > > > > |
> > > > > |
> > > > > 邮箱:[hidden email]
> > > > > |
> > > > >
> > > > > 签名由 网易邮箱大师 定制
> > > > >
> > > > > On 07/15/2020 17:16, Roc Marshal wrote:
> > > > > Hi,SmileSmile.
> > > > > 个人之前有遇到过 类似 的host解析问题,可以从k8s的pod节点网络映射角度排查一下。
> > > > > 希望这对你有帮助。
> > > > >
> > > > >
> > > > > 祝好。
> > > > > Roc Marshal
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > 在 2020-07-15 17:04:18,"SmileSmile" <[hidden email]> 写道:
> > > > > >
> > > > > >Hi
> > > > > >
> > > > > >使用版本Flink 1.11,部署方式 kubernetes session。 TM个数30个,每个TM 4个slot。 job
> > > > > 并行度120.提交作业的时候出现大量的No hostname could be resolved for the IP
> > address,JM
> > > > time
> > > > > out,作业提交失败。web ui也会卡主无响应。
> > > > > >
> > > > > >用wordCount,并行度只有1提交也会刷,no hostname的日志会刷个几条,然后正常提交,如果并行度一上去,就会超时。
> > > > > >
> > > > > >
> > > > > >部分日志如下:
> > > > > >
> > > > > >2020-07-15 16:58:46,460 WARN
> > > > > org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] -
> No
> > > > > hostname could be resolved for the IP address 10.32.160.7, using
> IP
> > > > address
> > > > > as host name. Local input split assignment (such as for HDFS
> files)
> > may
> > > > be
> > > > > impacted.
> > > > > >2020-07-15 16:58:46,460 WARN
> > > > > org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] -
> No
> > > > > hostname could be resolved for the IP address 10.44.224.7, using
> IP
> > > > address
> > > > > as host name. Local input split assignment (such as for HDFS
> files)
> > may
> > > > be
> > > > > impacted.
> > > > > >2020-07-15 16:58:46,461 WARN
> > > > > org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] -
> No
> > > > > hostname could be resolved for the IP address 10.40.32.9, using IP
> > > > address
> > > > > as host name. Local input split assignment (such as for HDFS
> files)
> > may
> > > > be
> > > > > impacted.
> > > > > >
> > > > > >2020-07-15 16:59:10,236 INFO
> > > > > org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
> > [] -
> > > > The
> > > > > heartbeat of JobManager with id 69a0d460de468888a9f41c770d963c0a
> > timed
> > > > out.
> > > > > >2020-07-15 16:59:10,236 INFO
> > > > > org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
> > [] -
> > > > > Disconnect job manager 00000000000000000000000000000000
> > > > > @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for
> > job
> > > > > e1554c737e37ed79688a15c746b6e9ef from the resource manager.
> > > > > >
> > > > > >
> > > > > >how to deal with ?
> > > > > >
> > > > > >
> > > > > >beset !
> > > > > >
> > > > > >| |
> > > > > >a511955993
> > > > > >|
> > > > > >|
> > > > > >邮箱:[hidden email]
> > > > > >|
> > > > > >
> > > > > >签名由 网易邮箱大师 定制
> > > > >
> > > >
> > >
> >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.11 submit job timed out

Yang Wang
看你这个任务,失败的根本原因并不是“No hostname could be resolved
”,这个WARNING的原因可以单独讨论(如果在1.10里面不存在的话)。
你可以本地起一个Standalone的集群,也会有这样的WARNING,并不影响正常使用


失败的原因是slot 5分钟申请超时了,你给的日志里面2020-07-23 13:55:45,519到2020-07-23
13:58:18,037是空白的,没有进行省略吧?
这段时间按理应该是task开始deploy了。在日志里看到了JM->RM的心跳超时,同一个Pod里面的同一个进程通信也超时了
所以怀疑JM一直在FullGC,这个需要你确认一下


Best,
Yang

SmileSmile <[hidden email]> 于2020年7月23日周四 下午2:43写道:

> Hi Yang Wang
>
> 先分享下我这边的环境版本
>
>
> kubernetes:1.17.4.   CNI: weave
>
>
> 1 2 3 是我的一些疑惑
>
> 4 是JM日志
>
>
> 1. 去掉taskmanager-query-state-service.yaml后确实不行  nslookup
>
> kubectl exec -it busybox2 -- /bin/sh
> / # nslookup 10.47.96.2
> Server:          10.96.0.10
> Address:     10.96.0.10:53
>
> ** server can't find 2.96.47.10.in-addr.arpa: NXDOMAIN
>
>
>
> 2. Flink1.11和Flink1.10
>
> detail subtasks taskmanagers xxx x 这行
>  1.11变成了172-20-0-50。1.10是flink-taskmanager-7b5d6958b6-sfzlk:36459。这块的改动是?(目前这个集群跑着1.10和1.11,1.10可以正常运行,如果coredns有问题,1.10版本的flink应该也有一样的情况吧?)
>
> 3. coredns是否特殊配置?
>
> 在容器中解析域名是正常的,只是反向解析没有service才会有问题。coredns是否有什么需要配置?
>
>
> 4. time out时候的JM日志如下:
>
>
>
> 2020-07-23 13:53:00,228 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> ResourceManager akka.tcp://flink@flink-jobmanager:6123/user/rpc/resourcemanager_0
> was granted leadership with fencing token 00000000000000000000000000000000
> 2020-07-23 13:53:00,232 INFO
>  org.apache.flink.runtime.rpc.akka.AkkaRpcService             [] - Starting
> RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher
> at akka://flink/user/rpc/dispatcher_1 .
> 2020-07-23 13:53:00,233 INFO
>  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl [] -
> Starting the SlotManager.
> 2020-07-23 13:53:03,472 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registering TaskManager with ResourceID 1f9ae0cd95a28943a73be26323588696
> (akka.tcp://flink@10.34.128.9:6122/user/rpc/taskmanager_0) at
> ResourceManager
> 2020-07-23 13:53:03,777 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registering TaskManager with ResourceID cac09e751264e61615329c20713a84b4
> (akka.tcp://flink@10.32.160.6:6122/user/rpc/taskmanager_0) at
> ResourceManager
> 2020-07-23 13:53:03,787 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registering TaskManager with ResourceID 93c72d01d09f9ae427c5fc980ed4c1e4
> (akka.tcp://flink@10.39.0.8:6122/user/rpc/taskmanager_0) at
> ResourceManager
> 2020-07-23 13:53:04,044 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registering TaskManager with ResourceID 8adf2f8e81b77a16d5418a9e252c61e2
> (akka.tcp://flink@10.38.64.7:6122/user/rpc/taskmanager_0) at
> ResourceManager
> 2020-07-23 13:53:04,099 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registering TaskManager with ResourceID 23e9d2358f6eb76b9ae718d879d4f330
> (akka.tcp://flink@10.42.160.6:6122/user/rpc/taskmanager_0) at
> ResourceManager
> 2020-07-23 13:53:04,146 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registering TaskManager with ResourceID 092f8dee299e32df13db3111662b61f8
> (akka.tcp://flink@10.33.192.14:6122/user/rpc/taskmanager_0) at
> ResourceManager
>
>
> 2020-07-23 13:55:44,220 INFO
>  org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Received
> JobGraph submission 99a030d0e3f428490a501c0132f27a56 (JobTest).
> 2020-07-23 13:55:44,222 INFO
>  org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] -
> Submitting job 99a030d0e3f428490a501c0132f27a56 (JobTest).
> 2020-07-23 13:55:44,251 INFO
>  org.apache.flink.runtime.rpc.akka.AkkaRpcService             [] - Starting
> RPC endpoint for org.apache.flink.runtime.jobmaster.JobMaster at
> akka://flink/user/rpc/jobmanager_2 .
> 2020-07-23 13:55:44,260 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                 [] - Initializing job JobTest
> (99a030d0e3f428490a501c0132f27a56).
> 2020-07-23 13:55:44,278 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                 [] - Using restart back off time strategy
> NoRestartBackoffTimeStrategy for JobTest (99a030d0e3f428490a501c0132f27a56).
> 2020-07-23 13:55:44,319 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                 [] - Running initialization on master for job JobTest
> (99a030d0e3f428490a501c0132f27a56).
> 2020-07-23 13:55:44,319 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                 [] - Successfully ran initialization on master in 0 ms.
> 2020-07-23 13:55:44,428 INFO
>  org.apache.flink.runtime.scheduler.adapter.DefaultExecutionTopology [] -
> Built 1 pipelined regions in 25 ms
> 2020-07-23 13:55:44,437 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                 [] - Loading state backend via factory
> org.apache.flink.contrib.streaming.state.RocksDBStateBackendFactory
> 2020-07-23 13:55:44,456 INFO
>  org.apache.flink.contrib.streaming.state.RocksDBStateBackend [] - Using
> predefined options: DEFAULT.
> 2020-07-23 13:55:44,457 INFO
>  org.apache.flink.contrib.streaming.state.RocksDBStateBackend [] - Using
> default options factory:
> DefaultConfigurableOptionsFactory{configuredOptions={}}.
> 2020-07-23 13:55:44,466 WARN  org.apache.flink.runtime.util.HadoopUtils
>                  [] - Could not find Hadoop configuration via any of the
> supported methods (Flink configuration, environment variables).
> 2020-07-23 13:55:45,276 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                 [] - Using failover strategy
> org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@72bd8533
> for JobTest (99a030d0e3f428490a501c0132f27a56).
> 2020-07-23 13:55:45,280 INFO
>  org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl      [] -
> JobManager runner for job JobTest (99a030d0e3f428490a501c0132f27a56) was
> granted leadership with session id 00000000-0000-0000-0000-000000000000 at
> akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2.
> 2020-07-23 13:55:45,286 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                 [] - Starting scheduling with scheduling strategy
> [org.apache.flink.runtime.scheduler.strategy.EagerSchedulingStrategy]
>
>
>
> 2020-07-23 13:55:45,436 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> serve slot request, no ResourceManager connected. Adding as pending request
> [SlotRequestId{e092b12b96b0a98bbf057e71b9705c23}]
> 2020-07-23 13:55:45,436 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> serve slot request, no ResourceManager connected. Adding as pending request
> [SlotRequestId{4ad15f417716c9e07fca383990c0f52a}]
> 2020-07-23 13:55:45,436 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> serve slot request, no ResourceManager connected. Adding as pending request
> [SlotRequestId{345fdb427a893b7fc3f4f040f93445d2}]
> 2020-07-23 13:55:45,437 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> serve slot request, no ResourceManager connected. Adding as pending request
> [SlotRequestId{e559485ea7b0b7e17367816882538d90}]
> 2020-07-23 13:55:45,437 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> serve slot request, no ResourceManager connected. Adding as pending request
> [SlotRequestId{7be8f6c1aedb27b04e7feae68078685c}]
> 2020-07-23 13:55:45,437 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> serve slot request, no ResourceManager connected. Adding as pending request
> [SlotRequestId{582a86197884206652dff3aea2306bb3}]
> 2020-07-23 13:55:45,437 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> serve slot request, no ResourceManager connected. Adding as pending request
> [SlotRequestId{0cc24260eda3af299a0b321feefaf2cb}]
> 2020-07-23 13:55:45,437 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> serve slot request, no ResourceManager connected. Adding as pending request
> [SlotRequestId{240ca6f3d3b5ece6a98243ec8cadf616}]
> 2020-07-23 13:55:45,438 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> serve slot request, no ResourceManager connected. Adding as pending request
> [SlotRequestId{c35033d598a517acc108424bb9f809fb}]
> 2020-07-23 13:55:45,438 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> serve slot request, no ResourceManager connected. Adding as pending request
> [SlotRequestId{ad35013c3b532d4b4df1be62395ae0cf}]
> 2020-07-23 13:55:45,438 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> serve slot request, no ResourceManager connected. Adding as pending request
> [SlotRequestId{c929bd5e8daf432d01fad1ece3daec1a}]
> 2020-07-23 13:55:45,487 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                 [] - Connecting to ResourceManager
> akka.tcp://flink@flink-jobmanager
> :6123/user/rpc/resourcemanager_*(00000000000000000000000000000000)
> 2020-07-23 13:55:45,492 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                 [] - Resolved ResourceManager address, beginning
> registration
> 2020-07-23 13:55:45,493 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registering job manager 00000000000000000000000000000000
> @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> 99a030d0e3f428490a501c0132f27a56.
> 2020-07-23 13:55:45,499 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registered job manager 00000000000000000000000000000000
> @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> 99a030d0e3f428490a501c0132f27a56.
> 2020-07-23 13:55:45,501 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                 [] - JobManager successfully registered at ResourceManager,
> leader id: 00000000000000000000000000000000.
> 2020-07-23 13:55:45,501 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{15fd2a9565c2b080748c1d1592b1cbbc}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,502 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Request slot with profile ResourceProfile{UNKNOWN} for job
> 99a030d0e3f428490a501c0132f27a56 with allocation id
> d420d08bf2654d9ea76955c70db18b69.
> 2020-07-23 13:55:45,502 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{8cd72cc16f0e319d915a9a096a1096d7}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,503 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{e7e422409acebdb385014a9634af6a90}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,503 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{cef1af73546ca1fc27ca7a3322e9e815}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,503 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{108fe0b3086567ad79275eccef2fdaf8}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,503 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{265e67985eab7a6dc08024e53bf2708d}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,503 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{7087497a17c441f1a1d6fefcbc7cd0ea}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,503 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{14ac08438e79c8db8d25d93b99d62725}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
>
> 2020-07-23 13:55:45,514 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Request slot with profile ResourceProfile{UNKNOWN} for job
> 99a030d0e3f428490a501c0132f27a56 with allocation id
> fce526bbe3e1be91caa3e4b536b20e35.
> 2020-07-23 13:55:45,514 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{40c7abbb12514c405323b0569fb21647}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,514 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{a4985a9647b65b30a571258b45c8f2ce}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,515 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{c52a6eb2fa58050e71e7903590019fd1}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
>
> 2020-07-23 13:55:45,517 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Request slot with profile ResourceProfile{UNKNOWN} for job
> 99a030d0e3f428490a501c0132f27a56 with allocation id
> 18ac7ec802ebfcfed8c05ee9324a55a4.
>
> 2020-07-23 13:55:45,518 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Request slot with profile ResourceProfile{UNKNOWN} for job
> 99a030d0e3f428490a501c0132f27a56 with allocation id
> 7ec76cbe689eb418b63599e90ade19be.
> 2020-07-23 13:55:45,518 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{46d65692a8b5aad11b51f9a74a666a74}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,518 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{3670bb4f345eedf941cc18e477ba1e9d}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,518 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{4a12467d76b9e3df8bc3412c0be08e14}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,518 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{e092b12b96b0a98bbf057e71b9705c23}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,518 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{4ad15f417716c9e07fca383990c0f52a}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,518 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{345fdb427a893b7fc3f4f040f93445d2}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,519 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{e559485ea7b0b7e17367816882538d90}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,519 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Request slot with profile ResourceProfile{UNKNOWN} for job
> 99a030d0e3f428490a501c0132f27a56 with allocation id
> b78837a29b4032924ac25be70ed21a3c.
>
>
> 2020-07-23 13:58:18,037 WARN
>  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> hostname could be resolved for the IP address 10.47.96.2, using IP address
> as host name. Local input split assignment (such as for HDFS files) may be
> impacted.
> 2020-07-23 13:58:22,192 WARN
>  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> hostname could be resolved for the IP address 10.34.64.14, using IP address
> as host name. Local input split assignment (such as for HDFS files) may be
> impacted.
> 2020-07-23 13:58:22,358 WARN
>  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> hostname could be resolved for the IP address 10.34.128.9, using IP address
> as host name. Local input split assignment (such as for HDFS files) may be
> impacted.
> 2020-07-23 13:58:24,562 WARN
>  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> hostname could be resolved for the IP address 10.32.160.6, using IP address
> as host name. Local input split assignment (such as for HDFS files) may be
> impacted.
> 2020-07-23 13:58:25,487 WARN
>  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> hostname could be resolved for the IP address 10.38.64.7, using IP address
> as host name. Local input split assignment (such as for HDFS files) may be
> impacted.
> 2020-07-23 13:58:27,636 WARN
>  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> hostname could be resolved for the IP address 10.42.160.6, using IP address
> as host name. Local input split assignment (such as for HDFS files) may be
> impacted.
> 2020-07-23 13:58:27,767 WARN
>  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> hostname could be resolved for the IP address 10.43.64.12, using IP address
> as host name. Local input split assignment (such as for HDFS files) may be
> impacted.
> 2020-07-23 13:58:29,651 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> The heartbeat of JobManager with id 456a18b6c404cb11a359718e16de1c6b timed
> out.
> 2020-07-23 13:58:29,651 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Disconnect job manager 00000000000000000000000000000000
> @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> 99a030d0e3f428490a501c0132f27a56 from the resource manager.
> 2020-07-23 13:58:29,854 WARN
>  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> hostname could be resolved for the IP address 10.39.0.8, using IP address
> as host name. Local input split assignment (such as for HDFS files) may be
> impacted.
> 2020-07-23 13:58:33,623 WARN
>  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> hostname could be resolved for the IP address 10.35.0.10, using IP address
> as host name. Local input split assignment (such as for HDFS files) may be
> impacted.
> 2020-07-23 13:58:35,756 WARN
>  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> hostname could be resolved for the IP address 10.36.32.8, using IP address
> as host name. Local input split assignment (such as for HDFS files) may be
> impacted.
> 2020-07-23 13:58:36,694 WARN
>  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> hostname could be resolved for the IP address 10.42.128.6, using IP address
> as host name. Local input split assignment (such as for HDFS files) may be
> impacted.
>
>
> 2020-07-23 14:01:17,814 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                 [] - Close ResourceManager connection
> 83b1ff14900abfd54418e7fa3efb3f8a: The heartbeat of JobManager with id
> 456a18b6c404cb11a359718e16de1c6b timed out..
> 2020-07-23 14:01:17,815 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                 [] - Connecting to ResourceManager
> akka.tcp://flink@flink-jobmanager
> :6123/user/rpc/resourcemanager_*(00000000000000000000000000000000)
> 2020-07-23 14:01:17,816 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                 [] - Resolved ResourceManager address, beginning
> registration
> 2020-07-23 14:01:17,816 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registering job manager 00000000000000000000000000000000
> @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> 99a030d0e3f428490a501c0132f27a56.
> 2020-07-23 14:01:17,836 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> host_relation -> Timestamps/Watermarks -> Map (1/1)
> (302ca9640e2d209a543d843f2996ccd2) switched from SCHEDULED to FAILED on not
> deployed.
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> Could not allocate the required slot within slot request timeout. Please
> make sure that the cluster has enough resources.
>      at
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeWrapWithNoResourceAvailableException(DefaultScheduler.java:441)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$assignResourceOrHandleError$6(DefaultScheduler.java:422)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> ~[?:1.8.0_242]
>      at
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:168)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> ~[?:1.8.0_242]
>      at
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:726)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:537)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:432)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> ~[?:1.8.0_242]
>      at
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$forwardTo$21(FutureUtils.java:1120)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> ~[?:1.8.0_242]
>      at
> org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1036)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.actor.ActorCell.invoke(ActorCell.scala:561)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.dispatch.Mailbox.run(Mailbox.scala:225)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> Caused by: java.util.concurrent.CompletionException:
> java.util.concurrent.TimeoutException
>      at
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
> ~[?:1.8.0_242]
>      ... 25 more
> Caused by: java.util.concurrent.TimeoutException
>      ... 23 more
> 2020-07-23 14:01:17,848 INFO
>  org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
> [] - Calculating tasks to restart to recover the failed task
> cbc357ccb763df2852fee8c4fc7d55f2_0.
> 2020-07-23 14:01:17,910 INFO
>  org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
> [] - 902 tasks should be restarted to recover the failed task
> cbc357ccb763df2852fee8c4fc7d55f2_0.
> 2020-07-23 14:01:17,913 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job
> JobTest (99a030d0e3f428490a501c0132f27a56) switched from state RUNNING to
> FAILING.
> org.apache.flink.runtime.JobException: Recovery is suppressed by
> NoRestartBackoffTimeStrategy
>      at
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:116)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:78)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:192)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:185)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:179)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:503)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyTaskFailure(UpdateSchedulerNgOnInternalFailuresListener.java:49)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.executiongraph.ExecutionGraph.notifySchedulerNgAboutInternalTaskFailure(ExecutionGraph.java:1710)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1287)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1255)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.executiongraph.Execution.markFailed(Execution.java:1086)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.executiongraph.ExecutionVertex.markFailed(ExecutionVertex.java:748)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.scheduler.DefaultExecutionVertexOperations.markFailed(DefaultExecutionVertexOperations.java:41)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskDeploymentFailure(DefaultScheduler.java:435)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$assignResourceOrHandleError$6(DefaultScheduler.java:422)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> ~[?:1.8.0_242]
>      at
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:168)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> ~[?:1.8.0_242]
>      at
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:726)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:537)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:432)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> ~[?:1.8.0_242]
>      at
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$forwardTo$21(FutureUtils.java:1120)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> ~[?:1.8.0_242]
>      at
> org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1036)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.actor.ActorCell.invoke(ActorCell.scala:561)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.dispatch.Mailbox.run(Mailbox.scala:225)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> Caused by:
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> Could not allocate the required slot within slot request timeout. Please
> make sure that the cluster has enough resources.
>      at
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeWrapWithNoResourceAvailableException(DefaultScheduler.java:441)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      ... 45 more
> Caused by: java.util.concurrent.CompletionException:
> java.util.concurrent.TimeoutException
>      at
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
> ~[?:1.8.0_242]
>      ... 25 more
> Caused by: java.util.concurrent.TimeoutException
>      ... 23 more
>
>
>
> 2020-07-23 14:01:18,109 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> Discarding the results produced by task execution
> 1809eb912d69854f2babedeaf879df6a.
> 2020-07-23 14:01:18,110 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job
> JobTest (99a030d0e3f428490a501c0132f27a56) switched from state FAILING to
> FAILED.
> org.apache.flink.runtime.JobException: Recovery is suppressed by
> NoRestartBackoffTimeStrategy
>      at
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:116)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:78)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:192)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:185)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:179)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:503)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyTaskFailure(UpdateSchedulerNgOnInternalFailuresListener.java:49)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.executiongraph.ExecutionGraph.notifySchedulerNgAboutInternalTaskFailure(ExecutionGraph.java:1710)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1287)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1255)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.executiongraph.Execution.markFailed(Execution.java:1086)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.executiongraph.ExecutionVertex.markFailed(ExecutionVertex.java:748)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.scheduler.DefaultExecutionVertexOperations.markFailed(DefaultExecutionVertexOperations.java:41)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskDeploymentFailure(DefaultScheduler.java:435)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$assignResourceOrHandleError$6(DefaultScheduler.java:422)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> ~[?:1.8.0_242]
>      at
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:168)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> ~[?:1.8.0_242]
>      at
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:726)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:537)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:432)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> ~[?:1.8.0_242]
>      at
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$forwardTo$21(FutureUtils.java:1120)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> ~[?:1.8.0_242]
>      at
> org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1036)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.actor.ActorCell.invoke(ActorCell.scala:561)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.dispatch.Mailbox.run(Mailbox.scala:225)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> Caused by:
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> Could not allocate the required slot within slot request timeout. Please
> make sure that the cluster has enough resources.
>      at
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeWrapWithNoResourceAvailableException(DefaultScheduler.java:441)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      ... 45 more
> Caused by: java.util.concurrent.CompletionException:
> java.util.concurrent.TimeoutException
>      at
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
> ~[?:1.8.0_242]
>      ... 25 more
> Caused by: java.util.concurrent.TimeoutException
>      ... 23 more
> 2020-07-23 14:01:18,114 INFO
>  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Stopping
> checkpoint coordinator for job 99a030d0e3f428490a501c0132f27a56.
> 2020-07-23 14:01:18,117 INFO
>  org.apache.flink.runtime.checkpoint.StandaloneCompletedCheckpointStore []
> - Shutting down
> 2020-07-23 14:01:18,118 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> Discarding the results produced by task execution
> 302ca9640e2d209a543d843f2996ccd2.
> 2020-07-23 14:01:18,120 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Pending
> slot request [SlotRequestId{15fd2a9565c2b080748c1d1592b1cbbc}] timed out.
> 2020-07-23 14:01:18,120 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Pending
> slot request [SlotRequestId{8cd72cc16f0e319d915a9a096a1096d7}] timed out.
> 2020-07-23 14:01:18,120 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Pending
> slot request [SlotRequestId{e7e422409acebdb385014a9634af6a90}] timed out.
> 2020-07-23 14:01:18,121 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Pending
> slot request [SlotRequestId{cef1af73546ca1fc27ca7a3322e9e815}] timed out.
> 2020-07-23 14:01:18,121 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Pending
> slot request [SlotRequestId{108fe0b3086567ad79275eccef2fdaf8}] timed out.
> 2020-07-23 14:01:18,121 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Pending
> slot request [SlotRequestId{265e67985eab7a6dc08024e53bf2708d}] timed out.
> 2020-07-23 14:01:18,122 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Pending
> slot request [SlotRequestId{7087497a17c441f1a1d6fefcbc7cd0ea}] timed out.
> 2020-07-23 14:01:18,122 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Pending
> slot request [
>
>
> 2020-07-23 14:01:18,151 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registering job manager 00000000000000000000000000000000
> @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> 99a030d0e3f428490a501c0132f27a56.
> 2020-07-23 14:01:18,157 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registered job manager 00000000000000000000000000000000
> @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> 99a030d0e3f428490a501c0132f27a56.
> 2020-07-23 14:01:18,157 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registered job manager 00000000000000000000000000000000
> @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> 99a030d0e3f428490a501c0132f27a56.
> 2020-07-23 14:01:18,157 INFO
>  org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Job
> 99a030d0e3f428490a501c0132f27a56 reached globally terminal state FAILED.
> 2020-07-23 14:01:18,162 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registered job manager 00000000000000000000000000000000
> @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> 99a030d0e3f428490a501c0132f27a56.
> 2020-07-23 14:01:18,162 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                 [] - JobManager successfully registered at ResourceManager,
> leader id: 00000000000000000000000000000000.
> 2020-07-23 14:01:18,225 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                 [] - Stopping the JobMaster for job
> JobTest(99a030d0e3f428490a501c0132f27a56).
> 2020-07-23 14:01:18,381 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Suspending SlotPool.
> 2020-07-23 14:01:18,382 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                 [] - Close ResourceManager connection
> 83b1ff14900abfd54418e7fa3efb3f8a: JobManager is shutting down..
> 2020-07-23 14:01:18,382 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Stopping
> SlotPool.
> 2020-07-23 14:01:18,382 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Disconnect job manager 00000000000000000000000000000000
> @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> 99a030d0e3f428490a501c0132f27a56 from the resource manager.
>
> a511955993
> 邮箱:[hidden email]
>
> <https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=a511955993&uid=a511955993%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22%E9%82%AE%E7%AE%B1%EF%BC%9Aa511955993%40163.com%22%5D>
>
> 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail88> 定制
>
> On 07/23/2020 13:26, Yang Wang <[hidden email]> wrote:
> 很高兴你的问题解决了,但我觉得根本原因应该不是加上了taskmanager-query-state-service.yaml的关系。
> 我这边不创建这个服务也是正常的,而且nslookup {tm_ip_address}是可以正常反解析到hostname的。
>
> 注意这里不是解析hostname,而是通过ip地址来反解析进行验证
>
>
> 回答你说的两个问题:
> 1. 不是必须的,我这边验证不需要创建,集群也是可以正常运行任务的。Rest
> service的暴露方式是ClusterIP、NodePort、LoadBalancer都正常
> 2. 如果没有配置taskmanager.bind-host,
> [Flink-15911][Flink-15154]这两个JIRA并不会影响TM向RM注册时候的使用的地址
>
> 如果你想找到根本原因,那可能需要你这边提供JM/TM的完整log,这样方便分析
>
>
> Best,
> Yang
>
> SmileSmile <[hidden email]> 于2020年7月23日周四 上午11:30写道:
>
> >
> > Hi Yang Wang
> >
> > 刚刚在测试环境测试了一下,taskManager没有办法nslookup出来,JM可以nslookup,这两者的差别在于是否有service。
> >
> > 解决方案:我这边给集群加上了taskmanager-query-state-service.yaml(按照官网上是可选服务)。就不会刷No
> > hostname could be resolved for ip
> > address,将NodePort改为ClusterIp,作业就可以成功提交,不会出现time out的问题了,问题得到了解决。
> >
> >
> > 1. 如果按照上面的情况,那么这个配置文件是必须配置的?
> >
> > 2. 在1.11的更新中,发现有 [Flink-15911][Flink-15154]
> > 支持分别配置用于本地监听绑定的网络接口和外部访问的地址和端口。是否是这块的改动,
> > 需要JM去通过TM上报的ip反向解析出service?
> >
> >
> > Bset!
> >
> >
> > [1]
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html
> >
> > a511955993
> > 邮箱:[hidden email]
> >
> > <
> https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=a511955993&uid=a511955993%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22%E9%82%AE%E7%AE%B1%EF%BC%9Aa511955993%40163.com%22%5D&gt;
>
> >
> > 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail88&gt; 定制
> >
> > On 07/23/2020 10:11, Yang Wang <[hidden email]> wrote:
> > 我的意思就是你在Flink任务运行的过程中,然后下面的命令在集群里面起一个busybox的pod,
> > 在里面执行 nslookup {ip_address},看看是否能够正常解析到。如果不能应该就是coredns的
> > 问题了
> >
> > kubectl run -i -t busybox --image=busybox --restart=Never
> >
> > 你需要确认下集群的coredns pod是否正常,一般是部署在kube-system这个namespace下的
> >
> >
> >
> > Best,
> > Yang
> >
> >
> > SmileSmile <[hidden email]> 于2020年7月22日周三 下午7:57写道:
> >
> > >
> > > Hi,Yang Wang!
> > >
> > > 很开心可以收到你的回复,你的回复帮助很大,让我知道了问题的方向。我再补充些信息,希望可以帮我进一步判断一下问题根源。
> > >
> > > 在JM报错的地方,No hostname could be resolved for ip address xxxxx
> > > ,报出来的ip是k8s分配给flink pod的内网ip,不是宿主机的ip。请问这个问题可能出在哪里呢
> > >
> > > Best!
> > >
> > >
> > > a511955993
> > > 邮箱:[hidden email]
> > >
> > > <
> >
> https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=a511955993&uid=a511955993%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22%E9%82%AE%E7%AE%B1%EF%BC%9Aa511955993%40163.com%22%5D&gt;
>
> >
> > >
> > > 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail88&gt; 定制
> > >
> > > On 07/22/2020 18:18, Yang Wang <[hidden email]> wrote:
> > > 如果你的日志里面一直在刷No hostname could be resolved for the IP
> > address,应该是集群的coredns
> > > 有问题,由ip地址反查hostname查不到。你可以起一个busybox验证一下是不是这个ip就解析不了,有
> > > 可能是coredns有问题
> > >
> > >
> > > Best,
> > > Yang
> > >
> > > Congxian Qiu <[hidden email]> 于2020年7月21日周二 下午7:29写道:
> > >
> > > > Hi
> > > >    不确定 k8s 环境中能否看到 pod 的完整日志?类似 Yarn 的 NM 日志一样,如果有的话,可以尝试看一下这个 pod
> > > > 的完整日志有没有什么发现
> > > > Best,
> > > > Congxian
> > > >
> > > >
> > > > SmileSmile <[hidden email]> 于2020年7月21日周二 下午3:19写道:
> > > >
> > > > > Hi,Congxian
> > > > >
> > > > > 因为是测试环境,没有配置HA,目前看到的信息,就是JM刷出来大量的no hostname could be
> > > > > resolved,jm失联,作业提交失败。
> > > > > 将jm内存配置为10g也是一样的情况(jobmanager.memory.pprocesa.size:10240m)。
> > > > >
> > > > > 在同一个环境将版本回退到1.10没有出现该问题,也不会刷如上报错。
> > > > >
> > > > >
> > > > > 是否有其他排查思路?
> > > > >
> > > > > Best!
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > | |
> > > > > a511955993
> > > > > |
> > > > > |
> > > > > 邮箱:[hidden email]
> > > > > |
> > > > >
> > > > > 签名由 网易邮箱大师 定制
> > > > >
> > > > > On 07/16/2020 13:17, Congxian Qiu wrote:
> > > > > Hi
> > > > >   如果没有异常,GC 情况也正常的话,或许可以看一下 pod 的相关日志,如果开启了 HA 也可以看一下 zk
> > 的日志。之前遇到过一次在
> > > > Yarn
> > > > > 环境中类似的现象是由于其他原因导致的,通过看 NM 日志以及 zk 日志发现的原因。
> > > > >
> > > > > Best,
> > > > > Congxian
> > > > >
> > > > >
> > > > > SmileSmile <[hidden email]> 于2020年7月15日周三 下午5:20写道:
> > > > >
> > > > > > Hi Roc
> > > > > >
> > > > > > 该现象在1.10.1版本没有,在1.11版本才出现。请问这个该如何查比较合适
> > > > > >
> > > > > >
> > > > > >
> > > > > > | |
> > > > > > a511955993
> > > > > > |
> > > > > > |
> > > > > > 邮箱:[hidden email]
> > > > > > |
> > > > > >
> > > > > > 签名由 网易邮箱大师 定制
> > > > > >
> > > > > > On 07/15/2020 17:16, Roc Marshal wrote:
> > > > > > Hi,SmileSmile.
> > > > > > 个人之前有遇到过 类似 的host解析问题,可以从k8s的pod节点网络映射角度排查一下。
> > > > > > 希望这对你有帮助。
> > > > > >
> > > > > >
> > > > > > 祝好。
> > > > > > Roc Marshal
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > 在 2020-07-15 17:04:18,"SmileSmile" <[hidden email]> 写道:
> > > > > > >
> > > > > > >Hi
> > > > > > >
> > > > > > >使用版本Flink 1.11,部署方式 kubernetes session。 TM个数30个,每个TM 4个slot。
> job
> > > > > > 并行度120.提交作业的时候出现大量的No hostname could be resolved for the IP
> > > address,JM
> > > > > time
> > > > > > out,作业提交失败。web ui也会卡主无响应。
> > > > > > >
> > > > > > >用wordCount,并行度只有1提交也会刷,no
> hostname的日志会刷个几条,然后正常提交,如果并行度一上去,就会超时。
> > > > > > >
> > > > > > >
> > > > > > >部分日志如下:
> > > > > > >
> > > > > > >2020-07-15 16:58:46,460 WARN
> > > > > > org.apache.flink.runtime.taskmanager.TaskManagerLocation     []
> -
> > No
> > > > > > hostname could be resolved for the IP address 10.32.160.7, using
> > IP
> > > > > address
> > > > > > as host name. Local input split assignment (such as for HDFS
> > files)
> > > may
> > > > > be
> > > > > > impacted.
> > > > > > >2020-07-15 16:58:46,460 WARN
> > > > > > org.apache.flink.runtime.taskmanager.TaskManagerLocation     []
> -
> > No
> > > > > > hostname could be resolved for the IP address 10.44.224.7, using
> > IP
> > > > > address
> > > > > > as host name. Local input split assignment (such as for HDFS
> > files)
> > > may
> > > > > be
> > > > > > impacted.
> > > > > > >2020-07-15 16:58:46,461 WARN
> > > > > > org.apache.flink.runtime.taskmanager.TaskManagerLocation     []
> -
> > No
> > > > > > hostname could be resolved for the IP address 10.40.32.9, using
> IP
> > > > > address
> > > > > > as host name. Local input split assignment (such as for HDFS
> > files)
> > > may
> > > > > be
> > > > > > impacted.
> > > > > > >
> > > > > > >2020-07-15 16:59:10,236 INFO
> > > > > >
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
> > > [] -
> > > > > The
> > > > > > heartbeat of JobManager with id 69a0d460de468888a9f41c770d963c0a
> > > timed
> > > > > out.
> > > > > > >2020-07-15 16:59:10,236 INFO
> > > > > >
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
> > > [] -
> > > > > > Disconnect job manager 00000000000000000000000000000000
> > > > > > @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2
> for
> > > job
> > > > > > e1554c737e37ed79688a15c746b6e9ef from the resource manager.
> > > > > > >
> > > > > > >
> > > > > > >how to deal with ?
> > > > > > >
> > > > > > >
> > > > > > >beset !
> > > > > > >
> > > > > > >| |
> > > > > > >a511955993
> > > > > > >|
> > > > > > >|
> > > > > > >邮箱:[hidden email]
> > > > > > >|
> > > > > > >
> > > > > > >签名由 网易邮箱大师 定制
> > > > > >
> > > > >
> > > >
> > >
> > >
> >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.11 submit job timed out

Chris Guo
Hi,Yang Wang

因为日志太长了,删了一些重复的内容。
一开始怀疑过jm gc的问题,将jm的内存调整为10g也是一样的情况。

Best



| |
a511955993
|
|
邮箱:[hidden email]
|

签名由 网易邮箱大师 定制

On 07/27/2020 11:36, Yang Wang wrote:
看你这个任务,失败的根本原因并不是“No hostname could be resolved
”,这个WARNING的原因可以单独讨论(如果在1.10里面不存在的话)。
你可以本地起一个Standalone的集群,也会有这样的WARNING,并不影响正常使用


失败的原因是slot 5分钟申请超时了,你给的日志里面2020-07-23 13:55:45,519到2020-07-23
13:58:18,037是空白的,没有进行省略吧?
这段时间按理应该是task开始deploy了。在日志里看到了JM->RM的心跳超时,同一个Pod里面的同一个进程通信也超时了
所以怀疑JM一直在FullGC,这个需要你确认一下


Best,
Yang

SmileSmile <[hidden email]> 于2020年7月23日周四 下午2:43写道:

> Hi Yang Wang
>
> 先分享下我这边的环境版本
>
>
> kubernetes:1.17.4.   CNI: weave
>
>
> 1 2 3 是我的一些疑惑
>
> 4 是JM日志
>
>
> 1. 去掉taskmanager-query-state-service.yaml后确实不行  nslookup
>
> kubectl exec -it busybox2 -- /bin/sh
> / # nslookup 10.47.96.2
> Server:          10.96.0.10
> Address:     10.96.0.10:53
>
> ** server can't find 2.96.47.10.in-addr.arpa: NXDOMAIN
>
>
>
> 2. Flink1.11和Flink1.10
>
> detail subtasks taskmanagers xxx x 这行
>  1.11变成了172-20-0-50。1.10是flink-taskmanager-7b5d6958b6-sfzlk:36459。这块的改动是?(目前这个集群跑着1.10和1.11,1.10可以正常运行,如果coredns有问题,1.10版本的flink应该也有一样的情况吧?)
>
> 3. coredns是否特殊配置?
>
> 在容器中解析域名是正常的,只是反向解析没有service才会有问题。coredns是否有什么需要配置?
>
>
> 4. time out时候的JM日志如下:
>
>
>
> 2020-07-23 13:53:00,228 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> ResourceManager akka.tcp://flink@flink-jobmanager:6123/user/rpc/resourcemanager_0
> was granted leadership with fencing token 00000000000000000000000000000000
> 2020-07-23 13:53:00,232 INFO
>  org.apache.flink.runtime.rpc.akka.AkkaRpcService             [] - Starting
> RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher
> at akka://flink/user/rpc/dispatcher_1 .
> 2020-07-23 13:53:00,233 INFO
>  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl [] -
> Starting the SlotManager.
> 2020-07-23 13:53:03,472 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registering TaskManager with ResourceID 1f9ae0cd95a28943a73be26323588696
> (akka.tcp://flink@10.34.128.9:6122/user/rpc/taskmanager_0) at
> ResourceManager
> 2020-07-23 13:53:03,777 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registering TaskManager with ResourceID cac09e751264e61615329c20713a84b4
> (akka.tcp://flink@10.32.160.6:6122/user/rpc/taskmanager_0) at
> ResourceManager
> 2020-07-23 13:53:03,787 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registering TaskManager with ResourceID 93c72d01d09f9ae427c5fc980ed4c1e4
> (akka.tcp://flink@10.39.0.8:6122/user/rpc/taskmanager_0) at
> ResourceManager
> 2020-07-23 13:53:04,044 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registering TaskManager with ResourceID 8adf2f8e81b77a16d5418a9e252c61e2
> (akka.tcp://flink@10.38.64.7:6122/user/rpc/taskmanager_0) at
> ResourceManager
> 2020-07-23 13:53:04,099 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registering TaskManager with ResourceID 23e9d2358f6eb76b9ae718d879d4f330
> (akka.tcp://flink@10.42.160.6:6122/user/rpc/taskmanager_0) at
> ResourceManager
> 2020-07-23 13:53:04,146 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registering TaskManager with ResourceID 092f8dee299e32df13db3111662b61f8
> (akka.tcp://flink@10.33.192.14:6122/user/rpc/taskmanager_0) at
> ResourceManager
>
>
> 2020-07-23 13:55:44,220 INFO
>  org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Received
> JobGraph submission 99a030d0e3f428490a501c0132f27a56 (JobTest).
> 2020-07-23 13:55:44,222 INFO
>  org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] -
> Submitting job 99a030d0e3f428490a501c0132f27a56 (JobTest).
> 2020-07-23 13:55:44,251 INFO
>  org.apache.flink.runtime.rpc.akka.AkkaRpcService             [] - Starting
> RPC endpoint for org.apache.flink.runtime.jobmaster.JobMaster at
> akka://flink/user/rpc/jobmanager_2 .
> 2020-07-23 13:55:44,260 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                 [] - Initializing job JobTest
> (99a030d0e3f428490a501c0132f27a56).
> 2020-07-23 13:55:44,278 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                 [] - Using restart back off time strategy
> NoRestartBackoffTimeStrategy for JobTest (99a030d0e3f428490a501c0132f27a56).
> 2020-07-23 13:55:44,319 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                 [] - Running initialization on master for job JobTest
> (99a030d0e3f428490a501c0132f27a56).
> 2020-07-23 13:55:44,319 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                 [] - Successfully ran initialization on master in 0 ms.
> 2020-07-23 13:55:44,428 INFO
>  org.apache.flink.runtime.scheduler.adapter.DefaultExecutionTopology [] -
> Built 1 pipelined regions in 25 ms
> 2020-07-23 13:55:44,437 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                 [] - Loading state backend via factory
> org.apache.flink.contrib.streaming.state.RocksDBStateBackendFactory
> 2020-07-23 13:55:44,456 INFO
>  org.apache.flink.contrib.streaming.state.RocksDBStateBackend [] - Using
> predefined options: DEFAULT.
> 2020-07-23 13:55:44,457 INFO
>  org.apache.flink.contrib.streaming.state.RocksDBStateBackend [] - Using
> default options factory:
> DefaultConfigurableOptionsFactory{configuredOptions={}}.
> 2020-07-23 13:55:44,466 WARN  org.apache.flink.runtime.util.HadoopUtils
>                  [] - Could not find Hadoop configuration via any of the
> supported methods (Flink configuration, environment variables).
> 2020-07-23 13:55:45,276 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                 [] - Using failover strategy
> org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@72bd8533
> for JobTest (99a030d0e3f428490a501c0132f27a56).
> 2020-07-23 13:55:45,280 INFO
>  org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl      [] -
> JobManager runner for job JobTest (99a030d0e3f428490a501c0132f27a56) was
> granted leadership with session id 00000000-0000-0000-0000-000000000000 at
> akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2.
> 2020-07-23 13:55:45,286 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                 [] - Starting scheduling with scheduling strategy
> [org.apache.flink.runtime.scheduler.strategy.EagerSchedulingStrategy]
>
>
>
> 2020-07-23 13:55:45,436 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> serve slot request, no ResourceManager connected. Adding as pending request
> [SlotRequestId{e092b12b96b0a98bbf057e71b9705c23}]
> 2020-07-23 13:55:45,436 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> serve slot request, no ResourceManager connected. Adding as pending request
> [SlotRequestId{4ad15f417716c9e07fca383990c0f52a}]
> 2020-07-23 13:55:45,436 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> serve slot request, no ResourceManager connected. Adding as pending request
> [SlotRequestId{345fdb427a893b7fc3f4f040f93445d2}]
> 2020-07-23 13:55:45,437 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> serve slot request, no ResourceManager connected. Adding as pending request
> [SlotRequestId{e559485ea7b0b7e17367816882538d90}]
> 2020-07-23 13:55:45,437 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> serve slot request, no ResourceManager connected. Adding as pending request
> [SlotRequestId{7be8f6c1aedb27b04e7feae68078685c}]
> 2020-07-23 13:55:45,437 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> serve slot request, no ResourceManager connected. Adding as pending request
> [SlotRequestId{582a86197884206652dff3aea2306bb3}]
> 2020-07-23 13:55:45,437 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> serve slot request, no ResourceManager connected. Adding as pending request
> [SlotRequestId{0cc24260eda3af299a0b321feefaf2cb}]
> 2020-07-23 13:55:45,437 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> serve slot request, no ResourceManager connected. Adding as pending request
> [SlotRequestId{240ca6f3d3b5ece6a98243ec8cadf616}]
> 2020-07-23 13:55:45,438 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> serve slot request, no ResourceManager connected. Adding as pending request
> [SlotRequestId{c35033d598a517acc108424bb9f809fb}]
> 2020-07-23 13:55:45,438 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> serve slot request, no ResourceManager connected. Adding as pending request
> [SlotRequestId{ad35013c3b532d4b4df1be62395ae0cf}]
> 2020-07-23 13:55:45,438 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> serve slot request, no ResourceManager connected. Adding as pending request
> [SlotRequestId{c929bd5e8daf432d01fad1ece3daec1a}]
> 2020-07-23 13:55:45,487 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                 [] - Connecting to ResourceManager
> akka.tcp://flink@flink-jobmanager
> :6123/user/rpc/resourcemanager_*(00000000000000000000000000000000)
> 2020-07-23 13:55:45,492 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                 [] - Resolved ResourceManager address, beginning
> registration
> 2020-07-23 13:55:45,493 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registering job manager 00000000000000000000000000000000
> @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> 99a030d0e3f428490a501c0132f27a56.
> 2020-07-23 13:55:45,499 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registered job manager 00000000000000000000000000000000
> @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> 99a030d0e3f428490a501c0132f27a56.
> 2020-07-23 13:55:45,501 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                 [] - JobManager successfully registered at ResourceManager,
> leader id: 00000000000000000000000000000000.
> 2020-07-23 13:55:45,501 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{15fd2a9565c2b080748c1d1592b1cbbc}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,502 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Request slot with profile ResourceProfile{UNKNOWN} for job
> 99a030d0e3f428490a501c0132f27a56 with allocation id
> d420d08bf2654d9ea76955c70db18b69.
> 2020-07-23 13:55:45,502 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{8cd72cc16f0e319d915a9a096a1096d7}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,503 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{e7e422409acebdb385014a9634af6a90}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,503 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{cef1af73546ca1fc27ca7a3322e9e815}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,503 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{108fe0b3086567ad79275eccef2fdaf8}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,503 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{265e67985eab7a6dc08024e53bf2708d}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,503 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{7087497a17c441f1a1d6fefcbc7cd0ea}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,503 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{14ac08438e79c8db8d25d93b99d62725}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
>
> 2020-07-23 13:55:45,514 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Request slot with profile ResourceProfile{UNKNOWN} for job
> 99a030d0e3f428490a501c0132f27a56 with allocation id
> fce526bbe3e1be91caa3e4b536b20e35.
> 2020-07-23 13:55:45,514 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{40c7abbb12514c405323b0569fb21647}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,514 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{a4985a9647b65b30a571258b45c8f2ce}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,515 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{c52a6eb2fa58050e71e7903590019fd1}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
>
> 2020-07-23 13:55:45,517 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Request slot with profile ResourceProfile{UNKNOWN} for job
> 99a030d0e3f428490a501c0132f27a56 with allocation id
> 18ac7ec802ebfcfed8c05ee9324a55a4.
>
> 2020-07-23 13:55:45,518 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Request slot with profile ResourceProfile{UNKNOWN} for job
> 99a030d0e3f428490a501c0132f27a56 with allocation id
> 7ec76cbe689eb418b63599e90ade19be.
> 2020-07-23 13:55:45,518 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{46d65692a8b5aad11b51f9a74a666a74}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,518 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{3670bb4f345eedf941cc18e477ba1e9d}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,518 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{4a12467d76b9e3df8bc3412c0be08e14}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,518 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{e092b12b96b0a98bbf057e71b9705c23}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,518 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{4ad15f417716c9e07fca383990c0f52a}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,518 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{345fdb427a893b7fc3f4f040f93445d2}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,519 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{e559485ea7b0b7e17367816882538d90}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,519 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Request slot with profile ResourceProfile{UNKNOWN} for job
> 99a030d0e3f428490a501c0132f27a56 with allocation id
> b78837a29b4032924ac25be70ed21a3c.
>
>
> 2020-07-23 13:58:18,037 WARN
>  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> hostname could be resolved for the IP address 10.47.96.2, using IP address
> as host name. Local input split assignment (such as for HDFS files) may be
> impacted.
> 2020-07-23 13:58:22,192 WARN
>  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> hostname could be resolved for the IP address 10.34.64.14, using IP address
> as host name. Local input split assignment (such as for HDFS files) may be
> impacted.
> 2020-07-23 13:58:22,358 WARN
>  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> hostname could be resolved for the IP address 10.34.128.9, using IP address
> as host name. Local input split assignment (such as for HDFS files) may be
> impacted.
> 2020-07-23 13:58:24,562 WARN
>  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> hostname could be resolved for the IP address 10.32.160.6, using IP address
> as host name. Local input split assignment (such as for HDFS files) may be
> impacted.
> 2020-07-23 13:58:25,487 WARN
>  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> hostname could be resolved for the IP address 10.38.64.7, using IP address
> as host name. Local input split assignment (such as for HDFS files) may be
> impacted.
> 2020-07-23 13:58:27,636 WARN
>  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> hostname could be resolved for the IP address 10.42.160.6, using IP address
> as host name. Local input split assignment (such as for HDFS files) may be
> impacted.
> 2020-07-23 13:58:27,767 WARN
>  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> hostname could be resolved for the IP address 10.43.64.12, using IP address
> as host name. Local input split assignment (such as for HDFS files) may be
> impacted.
> 2020-07-23 13:58:29,651 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> The heartbeat of JobManager with id 456a18b6c404cb11a359718e16de1c6b timed
> out.
> 2020-07-23 13:58:29,651 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Disconnect job manager 00000000000000000000000000000000
> @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> 99a030d0e3f428490a501c0132f27a56 from the resource manager.
> 2020-07-23 13:58:29,854 WARN
>  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> hostname could be resolved for the IP address 10.39.0.8, using IP address
> as host name. Local input split assignment (such as for HDFS files) may be
> impacted.
> 2020-07-23 13:58:33,623 WARN
>  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> hostname could be resolved for the IP address 10.35.0.10, using IP address
> as host name. Local input split assignment (such as for HDFS files) may be
> impacted.
> 2020-07-23 13:58:35,756 WARN
>  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> hostname could be resolved for the IP address 10.36.32.8, using IP address
> as host name. Local input split assignment (such as for HDFS files) may be
> impacted.
> 2020-07-23 13:58:36,694 WARN
>  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> hostname could be resolved for the IP address 10.42.128.6, using IP address
> as host name. Local input split assignment (such as for HDFS files) may be
> impacted.
>
>
> 2020-07-23 14:01:17,814 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                 [] - Close ResourceManager connection
> 83b1ff14900abfd54418e7fa3efb3f8a: The heartbeat of JobManager with id
> 456a18b6c404cb11a359718e16de1c6b timed out..
> 2020-07-23 14:01:17,815 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                 [] - Connecting to ResourceManager
> akka.tcp://flink@flink-jobmanager
> :6123/user/rpc/resourcemanager_*(00000000000000000000000000000000)
> 2020-07-23 14:01:17,816 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                 [] - Resolved ResourceManager address, beginning
> registration
> 2020-07-23 14:01:17,816 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registering job manager 00000000000000000000000000000000
> @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> 99a030d0e3f428490a501c0132f27a56.
> 2020-07-23 14:01:17,836 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
> host_relation -> Timestamps/Watermarks -> Map (1/1)
> (302ca9640e2d209a543d843f2996ccd2) switched from SCHEDULED to FAILED on not
> deployed.
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> Could not allocate the required slot within slot request timeout. Please
> make sure that the cluster has enough resources.
>      at
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeWrapWithNoResourceAvailableException(DefaultScheduler.java:441)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$assignResourceOrHandleError$6(DefaultScheduler.java:422)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> ~[?:1.8.0_242]
>      at
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:168)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> ~[?:1.8.0_242]
>      at
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:726)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:537)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:432)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> ~[?:1.8.0_242]
>      at
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$forwardTo$21(FutureUtils.java:1120)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> ~[?:1.8.0_242]
>      at
> org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1036)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.actor.ActorCell.invoke(ActorCell.scala:561)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.dispatch.Mailbox.run(Mailbox.scala:225)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> Caused by: java.util.concurrent.CompletionException:
> java.util.concurrent.TimeoutException
>      at
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
> ~[?:1.8.0_242]
>      ... 25 more
> Caused by: java.util.concurrent.TimeoutException
>      ... 23 more
> 2020-07-23 14:01:17,848 INFO
>  org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
> [] - Calculating tasks to restart to recover the failed task
> cbc357ccb763df2852fee8c4fc7d55f2_0.
> 2020-07-23 14:01:17,910 INFO
>  org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
> [] - 902 tasks should be restarted to recover the failed task
> cbc357ccb763df2852fee8c4fc7d55f2_0.
> 2020-07-23 14:01:17,913 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job
> JobTest (99a030d0e3f428490a501c0132f27a56) switched from state RUNNING to
> FAILING.
> org.apache.flink.runtime.JobException: Recovery is suppressed by
> NoRestartBackoffTimeStrategy
>      at
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:116)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:78)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:192)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:185)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:179)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:503)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyTaskFailure(UpdateSchedulerNgOnInternalFailuresListener.java:49)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.executiongraph.ExecutionGraph.notifySchedulerNgAboutInternalTaskFailure(ExecutionGraph.java:1710)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1287)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1255)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.executiongraph.Execution.markFailed(Execution.java:1086)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.executiongraph.ExecutionVertex.markFailed(ExecutionVertex.java:748)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.scheduler.DefaultExecutionVertexOperations.markFailed(DefaultExecutionVertexOperations.java:41)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskDeploymentFailure(DefaultScheduler.java:435)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$assignResourceOrHandleError$6(DefaultScheduler.java:422)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> ~[?:1.8.0_242]
>      at
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:168)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> ~[?:1.8.0_242]
>      at
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:726)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:537)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:432)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> ~[?:1.8.0_242]
>      at
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$forwardTo$21(FutureUtils.java:1120)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> ~[?:1.8.0_242]
>      at
> org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1036)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.actor.ActorCell.invoke(ActorCell.scala:561)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.dispatch.Mailbox.run(Mailbox.scala:225)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> Caused by:
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> Could not allocate the required slot within slot request timeout. Please
> make sure that the cluster has enough resources.
>      at
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeWrapWithNoResourceAvailableException(DefaultScheduler.java:441)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      ... 45 more
> Caused by: java.util.concurrent.CompletionException:
> java.util.concurrent.TimeoutException
>      at
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
> ~[?:1.8.0_242]
>      ... 25 more
> Caused by: java.util.concurrent.TimeoutException
>      ... 23 more
>
>
>
> 2020-07-23 14:01:18,109 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> Discarding the results produced by task execution
> 1809eb912d69854f2babedeaf879df6a.
> 2020-07-23 14:01:18,110 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job
> JobTest (99a030d0e3f428490a501c0132f27a56) switched from state FAILING to
> FAILED.
> org.apache.flink.runtime.JobException: Recovery is suppressed by
> NoRestartBackoffTimeStrategy
>      at
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:116)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:78)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:192)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:185)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:179)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:503)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyTaskFailure(UpdateSchedulerNgOnInternalFailuresListener.java:49)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.executiongraph.ExecutionGraph.notifySchedulerNgAboutInternalTaskFailure(ExecutionGraph.java:1710)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1287)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1255)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.executiongraph.Execution.markFailed(Execution.java:1086)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.executiongraph.ExecutionVertex.markFailed(ExecutionVertex.java:748)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.scheduler.DefaultExecutionVertexOperations.markFailed(DefaultExecutionVertexOperations.java:41)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskDeploymentFailure(DefaultScheduler.java:435)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$assignResourceOrHandleError$6(DefaultScheduler.java:422)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> ~[?:1.8.0_242]
>      at
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:168)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> ~[?:1.8.0_242]
>      at
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:726)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:537)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:432)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> ~[?:1.8.0_242]
>      at
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$forwardTo$21(FutureUtils.java:1120)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> ~[?:1.8.0_242]
>      at
> org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1036)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.actor.ActorCell.invoke(ActorCell.scala:561)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.dispatch.Mailbox.run(Mailbox.scala:225)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
>      at
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> Caused by:
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> Could not allocate the required slot within slot request timeout. Please
> make sure that the cluster has enough resources.
>      at
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeWrapWithNoResourceAvailableException(DefaultScheduler.java:441)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
>      ... 45 more
> Caused by: java.util.concurrent.CompletionException:
> java.util.concurrent.TimeoutException
>      at
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
> ~[?:1.8.0_242]
>      at
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
> ~[?:1.8.0_242]
>      ... 25 more
> Caused by: java.util.concurrent.TimeoutException
>      ... 23 more
> 2020-07-23 14:01:18,114 INFO
>  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Stopping
> checkpoint coordinator for job 99a030d0e3f428490a501c0132f27a56.
> 2020-07-23 14:01:18,117 INFO
>  org.apache.flink.runtime.checkpoint.StandaloneCompletedCheckpointStore []
> - Shutting down
> 2020-07-23 14:01:18,118 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> Discarding the results produced by task execution
> 302ca9640e2d209a543d843f2996ccd2.
> 2020-07-23 14:01:18,120 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Pending
> slot request [SlotRequestId{15fd2a9565c2b080748c1d1592b1cbbc}] timed out.
> 2020-07-23 14:01:18,120 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Pending
> slot request [SlotRequestId{8cd72cc16f0e319d915a9a096a1096d7}] timed out.
> 2020-07-23 14:01:18,120 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Pending
> slot request [SlotRequestId{e7e422409acebdb385014a9634af6a90}] timed out.
> 2020-07-23 14:01:18,121 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Pending
> slot request [SlotRequestId{cef1af73546ca1fc27ca7a3322e9e815}] timed out.
> 2020-07-23 14:01:18,121 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Pending
> slot request [SlotRequestId{108fe0b3086567ad79275eccef2fdaf8}] timed out.
> 2020-07-23 14:01:18,121 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Pending
> slot request [SlotRequestId{265e67985eab7a6dc08024e53bf2708d}] timed out.
> 2020-07-23 14:01:18,122 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Pending
> slot request [SlotRequestId{7087497a17c441f1a1d6fefcbc7cd0ea}] timed out.
> 2020-07-23 14:01:18,122 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Pending
> slot request [
>
>
> 2020-07-23 14:01:18,151 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registering job manager 00000000000000000000000000000000
> @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> 99a030d0e3f428490a501c0132f27a56.
> 2020-07-23 14:01:18,157 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registered job manager 00000000000000000000000000000000
> @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> 99a030d0e3f428490a501c0132f27a56.
> 2020-07-23 14:01:18,157 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registered job manager 00000000000000000000000000000000
> @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> 99a030d0e3f428490a501c0132f27a56.
> 2020-07-23 14:01:18,157 INFO
>  org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Job
> 99a030d0e3f428490a501c0132f27a56 reached globally terminal state FAILED.
> 2020-07-23 14:01:18,162 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registered job manager 00000000000000000000000000000000
> @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> 99a030d0e3f428490a501c0132f27a56.
> 2020-07-23 14:01:18,162 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                 [] - JobManager successfully registered at ResourceManager,
> leader id: 00000000000000000000000000000000.
> 2020-07-23 14:01:18,225 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                 [] - Stopping the JobMaster for job
> JobTest(99a030d0e3f428490a501c0132f27a56).
> 2020-07-23 14:01:18,381 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Suspending SlotPool.
> 2020-07-23 14:01:18,382 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>                 [] - Close ResourceManager connection
> 83b1ff14900abfd54418e7fa3efb3f8a: JobManager is shutting down..
> 2020-07-23 14:01:18,382 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Stopping
> SlotPool.
> 2020-07-23 14:01:18,382 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Disconnect job manager 00000000000000000000000000000000
> @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> 99a030d0e3f428490a501c0132f27a56 from the resource manager.
>
> a511955993
> 邮箱:[hidden email]
>
> <https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=a511955993&uid=a511955993%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22%E9%82%AE%E7%AE%B1%EF%BC%9Aa511955993%40163.com%22%5D&gt;
>
> 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail88&gt; 定制
>
> On 07/23/2020 13:26, Yang Wang <[hidden email]> wrote:
> 很高兴你的问题解决了,但我觉得根本原因应该不是加上了taskmanager-query-state-service.yaml的关系。
> 我这边不创建这个服务也是正常的,而且nslookup {tm_ip_address}是可以正常反解析到hostname的。
>
> 注意这里不是解析hostname,而是通过ip地址来反解析进行验证
>
>
> 回答你说的两个问题:
> 1. 不是必须的,我这边验证不需要创建,集群也是可以正常运行任务的。Rest
> service的暴露方式是ClusterIP、NodePort、LoadBalancer都正常
> 2. 如果没有配置taskmanager.bind-host,
> [Flink-15911][Flink-15154]这两个JIRA并不会影响TM向RM注册时候的使用的地址
>
> 如果你想找到根本原因,那可能需要你这边提供JM/TM的完整log,这样方便分析
>
>
> Best,
> Yang
>
> SmileSmile <[hidden email]> 于2020年7月23日周四 上午11:30写道:
>
> >
> > Hi Yang Wang
> >
> > 刚刚在测试环境测试了一下,taskManager没有办法nslookup出来,JM可以nslookup,这两者的差别在于是否有service。
> >
> > 解决方案:我这边给集群加上了taskmanager-query-state-service.yaml(按照官网上是可选服务)。就不会刷No
> > hostname could be resolved for ip
> > address,将NodePort改为ClusterIp,作业就可以成功提交,不会出现time out的问题了,问题得到了解决。
> >
> >
> > 1. 如果按照上面的情况,那么这个配置文件是必须配置的?
> >
> > 2. 在1.11的更新中,发现有 [Flink-15911][Flink-15154]
> > 支持分别配置用于本地监听绑定的网络接口和外部访问的地址和端口。是否是这块的改动,
> > 需要JM去通过TM上报的ip反向解析出service?
> >
> >
> > Bset!
> >
> >
> > [1]
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html
> >
> > a511955993
> > 邮箱:[hidden email]
> >
> > <
> https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=a511955993&uid=a511955993%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22%E9%82%AE%E7%AE%B1%EF%BC%9Aa511955993%40163.com%22%5D&gt;
>
> >
> > 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail88&gt; 定制
> >
> > On 07/23/2020 10:11, Yang Wang <[hidden email]> wrote:
> > 我的意思就是你在Flink任务运行的过程中,然后下面的命令在集群里面起一个busybox的pod,
> > 在里面执行 nslookup {ip_address},看看是否能够正常解析到。如果不能应该就是coredns的
> > 问题了
> >
> > kubectl run -i -t busybox --image=busybox --restart=Never
> >
> > 你需要确认下集群的coredns pod是否正常,一般是部署在kube-system这个namespace下的
> >
> >
> >
> > Best,
> > Yang
> >
> >
> > SmileSmile <[hidden email]> 于2020年7月22日周三 下午7:57写道:
> >
> > >
> > > Hi,Yang Wang!
> > >
> > > 很开心可以收到你的回复,你的回复帮助很大,让我知道了问题的方向。我再补充些信息,希望可以帮我进一步判断一下问题根源。
> > >
> > > 在JM报错的地方,No hostname could be resolved for ip address xxxxx
> > > ,报出来的ip是k8s分配给flink pod的内网ip,不是宿主机的ip。请问这个问题可能出在哪里呢
> > >
> > > Best!
> > >
> > >
> > > a511955993
> > > 邮箱:[hidden email]
> > >
> > > <
> >
> https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=a511955993&uid=a511955993%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22%E9%82%AE%E7%AE%B1%EF%BC%9Aa511955993%40163.com%22%5D&gt;
>
> >
> > >
> > > 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail88&gt; 定制
> > >
> > > On 07/22/2020 18:18, Yang Wang <[hidden email]> wrote:
> > > 如果你的日志里面一直在刷No hostname could be resolved for the IP
> > address,应该是集群的coredns
> > > 有问题,由ip地址反查hostname查不到。你可以起一个busybox验证一下是不是这个ip就解析不了,有
> > > 可能是coredns有问题
> > >
> > >
> > > Best,
> > > Yang
> > >
> > > Congxian Qiu <[hidden email]> 于2020年7月21日周二 下午7:29写道:
> > >
> > > > Hi
> > > >    不确定 k8s 环境中能否看到 pod 的完整日志?类似 Yarn 的 NM 日志一样,如果有的话,可以尝试看一下这个 pod
> > > > 的完整日志有没有什么发现
> > > > Best,
> > > > Congxian
> > > >
> > > >
> > > > SmileSmile <[hidden email]> 于2020年7月21日周二 下午3:19写道:
> > > >
> > > > > Hi,Congxian
> > > > >
> > > > > 因为是测试环境,没有配置HA,目前看到的信息,就是JM刷出来大量的no hostname could be
> > > > > resolved,jm失联,作业提交失败。
> > > > > 将jm内存配置为10g也是一样的情况(jobmanager.memory.pprocesa.size:10240m)。
> > > > >
> > > > > 在同一个环境将版本回退到1.10没有出现该问题,也不会刷如上报错。
> > > > >
> > > > >
> > > > > 是否有其他排查思路?
> > > > >
> > > > > Best!
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > | |
> > > > > a511955993
> > > > > |
> > > > > |
> > > > > 邮箱:[hidden email]
> > > > > |
> > > > >
> > > > > 签名由 网易邮箱大师 定制
> > > > >
> > > > > On 07/16/2020 13:17, Congxian Qiu wrote:
> > > > > Hi
> > > > >   如果没有异常,GC 情况也正常的话,或许可以看一下 pod 的相关日志,如果开启了 HA 也可以看一下 zk
> > 的日志。之前遇到过一次在
> > > > Yarn
> > > > > 环境中类似的现象是由于其他原因导致的,通过看 NM 日志以及 zk 日志发现的原因。
> > > > >
> > > > > Best,
> > > > > Congxian
> > > > >
> > > > >
> > > > > SmileSmile <[hidden email]> 于2020年7月15日周三 下午5:20写道:
> > > > >
> > > > > > Hi Roc
> > > > > >
> > > > > > 该现象在1.10.1版本没有,在1.11版本才出现。请问这个该如何查比较合适
> > > > > >
> > > > > >
> > > > > >
> > > > > > | |
> > > > > > a511955993
> > > > > > |
> > > > > > |
> > > > > > 邮箱:[hidden email]
> > > > > > |
> > > > > >
> > > > > > 签名由 网易邮箱大师 定制
> > > > > >
> > > > > > On 07/15/2020 17:16, Roc Marshal wrote:
> > > > > > Hi,SmileSmile.
> > > > > > 个人之前有遇到过 类似 的host解析问题,可以从k8s的pod节点网络映射角度排查一下。
> > > > > > 希望这对你有帮助。
> > > > > >
> > > > > >
> > > > > > 祝好。
> > > > > > Roc Marshal
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > 在 2020-07-15 17:04:18,"SmileSmile" <[hidden email]> 写道:
> > > > > > >
> > > > > > >Hi
> > > > > > >
> > > > > > >使用版本Flink 1.11,部署方式 kubernetes session。 TM个数30个,每个TM 4个slot。
> job
> > > > > > 并行度120.提交作业的时候出现大量的No hostname could be resolved for the IP
> > > address,JM
> > > > > time
> > > > > > out,作业提交失败。web ui也会卡主无响应。
> > > > > > >
> > > > > > >用wordCount,并行度只有1提交也会刷,no
> hostname的日志会刷个几条,然后正常提交,如果并行度一上去,就会超时。
> > > > > > >
> > > > > > >
> > > > > > >部分日志如下:
> > > > > > >
> > > > > > >2020-07-15 16:58:46,460 WARN
> > > > > > org.apache.flink.runtime.taskmanager.TaskManagerLocation     []
> -
> > No
> > > > > > hostname could be resolved for the IP address 10.32.160.7, using
> > IP
> > > > > address
> > > > > > as host name. Local input split assignment (such as for HDFS
> > files)
> > > may
> > > > > be
> > > > > > impacted.
> > > > > > >2020-07-15 16:58:46,460 WARN
> > > > > > org.apache.flink.runtime.taskmanager.TaskManagerLocation     []
> -
> > No
> > > > > > hostname could be resolved for the IP address 10.44.224.7, using
> > IP
> > > > > address
> > > > > > as host name. Local input split assignment (such as for HDFS
> > files)
> > > may
> > > > > be
> > > > > > impacted.
> > > > > > >2020-07-15 16:58:46,461 WARN
> > > > > > org.apache.flink.runtime.taskmanager.TaskManagerLocation     []
> -
> > No
> > > > > > hostname could be resolved for the IP address 10.40.32.9, using
> IP
> > > > > address
> > > > > > as host name. Local input split assignment (such as for HDFS
> > files)
> > > may
> > > > > be
> > > > > > impacted.
> > > > > > >
> > > > > > >2020-07-15 16:59:10,236 INFO
> > > > > >
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
> > > [] -
> > > > > The
> > > > > > heartbeat of JobManager with id 69a0d460de468888a9f41c770d963c0a
> > > timed
> > > > > out.
> > > > > > >2020-07-15 16:59:10,236 INFO
> > > > > >
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
> > > [] -
> > > > > > Disconnect job manager 00000000000000000000000000000000
> > > > > > @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2
> for
> > > job
> > > > > > e1554c737e37ed79688a15c746b6e9ef from the resource manager.
> > > > > > >
> > > > > > >
> > > > > > >how to deal with ?
> > > > > > >
> > > > > > >
> > > > > > >beset !
> > > > > > >
> > > > > > >| |
> > > > > > >a511955993
> > > > > > >|
> > > > > > >|
> > > > > > >邮箱:[hidden email]
> > > > > > >|
> > > > > > >
> > > > > > >签名由 网易邮箱大师 定制
> > > > > >
> > > > >
> > > >
> > >
> > >
> >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.11 submit job timed out

Yang Wang
建议先配置heartbeat.timeout的值大一些,然后把gc log打出来
看看是不是经常发生fullGC,每次持续时间是多长,从你目前提供的log看,进程内JM->RM都会心跳超时
怀疑还是和GC有关的

env.java.opts.jobmanager: -Xloggc:<LOG_DIR>/jobmanager-gc.log
-XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=2 -XX:GCLogFileSize=512M


Best,
Yang

SmileSmile <[hidden email]> 于2020年7月27日周一 下午1:50写道:

> Hi,Yang Wang
>
> 因为日志太长了,删了一些重复的内容。
> 一开始怀疑过jm gc的问题,将jm的内存调整为10g也是一样的情况。
>
> Best
>
>
>
> | |
> a511955993
> |
> |
> 邮箱:[hidden email]
> |
>
> 签名由 网易邮箱大师 定制
>
> On 07/27/2020 11:36, Yang Wang wrote:
> 看你这个任务,失败的根本原因并不是“No hostname could be resolved
> ”,这个WARNING的原因可以单独讨论(如果在1.10里面不存在的话)。
> 你可以本地起一个Standalone的集群,也会有这样的WARNING,并不影响正常使用
>
>
> 失败的原因是slot 5分钟申请超时了,你给的日志里面2020-07-23 13:55:45,519到2020-07-23
> 13:58:18,037是空白的,没有进行省略吧?
> 这段时间按理应该是task开始deploy了。在日志里看到了JM->RM的心跳超时,同一个Pod里面的同一个进程通信也超时了
> 所以怀疑JM一直在FullGC,这个需要你确认一下
>
>
> Best,
> Yang
>
> SmileSmile <[hidden email]> 于2020年7月23日周四 下午2:43写道:
>
> > Hi Yang Wang
> >
> > 先分享下我这边的环境版本
> >
> >
> > kubernetes:1.17.4.   CNI: weave
> >
> >
> > 1 2 3 是我的一些疑惑
> >
> > 4 是JM日志
> >
> >
> > 1. 去掉taskmanager-query-state-service.yaml后确实不行  nslookup
> >
> > kubectl exec -it busybox2 -- /bin/sh
> > / # nslookup 10.47.96.2
> > Server:          10.96.0.10
> > Address:     10.96.0.10:53
> >
> > ** server can't find 2.96.47.10.in-addr.arpa: NXDOMAIN
> >
> >
> >
> > 2. Flink1.11和Flink1.10
> >
> > detail subtasks taskmanagers xxx x 这行
> >
> 1.11变成了172-20-0-50。1.10是flink-taskmanager-7b5d6958b6-sfzlk:36459。这块的改动是?(目前这个集群跑着1.10和1.11,1.10可以正常运行,如果coredns有问题,1.10版本的flink应该也有一样的情况吧?)
> >
> > 3. coredns是否特殊配置?
> >
> > 在容器中解析域名是正常的,只是反向解析没有service才会有问题。coredns是否有什么需要配置?
> >
> >
> > 4. time out时候的JM日志如下:
> >
> >
> >
> > 2020-07-23 13:53:00,228 INFO
> >  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> > ResourceManager akka.tcp://flink@flink-jobmanager
> :6123/user/rpc/resourcemanager_0
> > was granted leadership with fencing token
> 00000000000000000000000000000000
> > 2020-07-23 13:53:00,232 INFO
> >  org.apache.flink.runtime.rpc.akka.AkkaRpcService             [] -
> Starting
> > RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher
> > at akka://flink/user/rpc/dispatcher_1 .
> > 2020-07-23 13:53:00,233 INFO
> >  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl []
> -
> > Starting the SlotManager.
> > 2020-07-23 13:53:03,472 INFO
> >  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> > Registering TaskManager with ResourceID 1f9ae0cd95a28943a73be26323588696
> > (akka.tcp://flink@10.34.128.9:6122/user/rpc/taskmanager_0) at
> > ResourceManager
> > 2020-07-23 13:53:03,777 INFO
> >  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> > Registering TaskManager with ResourceID cac09e751264e61615329c20713a84b4
> > (akka.tcp://flink@10.32.160.6:6122/user/rpc/taskmanager_0) at
> > ResourceManager
> > 2020-07-23 13:53:03,787 INFO
> >  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> > Registering TaskManager with ResourceID 93c72d01d09f9ae427c5fc980ed4c1e4
> > (akka.tcp://flink@10.39.0.8:6122/user/rpc/taskmanager_0) at
> > ResourceManager
> > 2020-07-23 13:53:04,044 INFO
> >  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> > Registering TaskManager with ResourceID 8adf2f8e81b77a16d5418a9e252c61e2
> > (akka.tcp://flink@10.38.64.7:6122/user/rpc/taskmanager_0) at
> > ResourceManager
> > 2020-07-23 13:53:04,099 INFO
> >  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> > Registering TaskManager with ResourceID 23e9d2358f6eb76b9ae718d879d4f330
> > (akka.tcp://flink@10.42.160.6:6122/user/rpc/taskmanager_0) at
> > ResourceManager
> > 2020-07-23 13:53:04,146 INFO
> >  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> > Registering TaskManager with ResourceID 092f8dee299e32df13db3111662b61f8
> > (akka.tcp://flink@10.33.192.14:6122/user/rpc/taskmanager_0) at
> > ResourceManager
> >
> >
> > 2020-07-23 13:55:44,220 INFO
> >  org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] -
> Received
> > JobGraph submission 99a030d0e3f428490a501c0132f27a56 (JobTest).
> > 2020-07-23 13:55:44,222 INFO
> >  org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] -
> > Submitting job 99a030d0e3f428490a501c0132f27a56 (JobTest).
> > 2020-07-23 13:55:44,251 INFO
> >  org.apache.flink.runtime.rpc.akka.AkkaRpcService             [] -
> Starting
> > RPC endpoint for org.apache.flink.runtime.jobmaster.JobMaster at
> > akka://flink/user/rpc/jobmanager_2 .
> > 2020-07-23 13:55:44,260 INFO
> org.apache.flink.runtime.jobmaster.JobMaster
> >                 [] - Initializing job JobTest
> > (99a030d0e3f428490a501c0132f27a56).
> > 2020-07-23 13:55:44,278 INFO
> org.apache.flink.runtime.jobmaster.JobMaster
> >                 [] - Using restart back off time strategy
> > NoRestartBackoffTimeStrategy for JobTest
> (99a030d0e3f428490a501c0132f27a56).
> > 2020-07-23 13:55:44,319 INFO
> org.apache.flink.runtime.jobmaster.JobMaster
> >                 [] - Running initialization on master for job JobTest
> > (99a030d0e3f428490a501c0132f27a56).
> > 2020-07-23 13:55:44,319 INFO
> org.apache.flink.runtime.jobmaster.JobMaster
> >                 [] - Successfully ran initialization on master in 0 ms.
> > 2020-07-23 13:55:44,428 INFO
> >  org.apache.flink.runtime.scheduler.adapter.DefaultExecutionTopology [] -
> > Built 1 pipelined regions in 25 ms
> > 2020-07-23 13:55:44,437 INFO
> org.apache.flink.runtime.jobmaster.JobMaster
> >                 [] - Loading state backend via factory
> > org.apache.flink.contrib.streaming.state.RocksDBStateBackendFactory
> > 2020-07-23 13:55:44,456 INFO
> >  org.apache.flink.contrib.streaming.state.RocksDBStateBackend [] - Using
> > predefined options: DEFAULT.
> > 2020-07-23 13:55:44,457 INFO
> >  org.apache.flink.contrib.streaming.state.RocksDBStateBackend [] - Using
> > default options factory:
> > DefaultConfigurableOptionsFactory{configuredOptions={}}.
> > 2020-07-23 13:55:44,466 WARN  org.apache.flink.runtime.util.HadoopUtils
> >                  [] - Could not find Hadoop configuration via any of the
> > supported methods (Flink configuration, environment variables).
> > 2020-07-23 13:55:45,276 INFO
> org.apache.flink.runtime.jobmaster.JobMaster
> >                 [] - Using failover strategy
> >
> org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@72bd8533
> > for JobTest (99a030d0e3f428490a501c0132f27a56).
> > 2020-07-23 13:55:45,280 INFO
> >  org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl      [] -
> > JobManager runner for job JobTest (99a030d0e3f428490a501c0132f27a56) was
> > granted leadership with session id 00000000-0000-0000-0000-000000000000
> at
> > akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2.
> > 2020-07-23 13:55:45,286 INFO
> org.apache.flink.runtime.jobmaster.JobMaster
> >                 [] - Starting scheduling with scheduling strategy
> > [org.apache.flink.runtime.scheduler.strategy.EagerSchedulingStrategy]
> >
> >
> >
> > 2020-07-23 13:55:45,436 INFO
> >  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> > serve slot request, no ResourceManager connected. Adding as pending
> request
> > [SlotRequestId{e092b12b96b0a98bbf057e71b9705c23}]
> > 2020-07-23 13:55:45,436 INFO
> >  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> > serve slot request, no ResourceManager connected. Adding as pending
> request
> > [SlotRequestId{4ad15f417716c9e07fca383990c0f52a}]
> > 2020-07-23 13:55:45,436 INFO
> >  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> > serve slot request, no ResourceManager connected. Adding as pending
> request
> > [SlotRequestId{345fdb427a893b7fc3f4f040f93445d2}]
> > 2020-07-23 13:55:45,437 INFO
> >  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> > serve slot request, no ResourceManager connected. Adding as pending
> request
> > [SlotRequestId{e559485ea7b0b7e17367816882538d90}]
> > 2020-07-23 13:55:45,437 INFO
> >  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> > serve slot request, no ResourceManager connected. Adding as pending
> request
> > [SlotRequestId{7be8f6c1aedb27b04e7feae68078685c}]
> > 2020-07-23 13:55:45,437 INFO
> >  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> > serve slot request, no ResourceManager connected. Adding as pending
> request
> > [SlotRequestId{582a86197884206652dff3aea2306bb3}]
> > 2020-07-23 13:55:45,437 INFO
> >  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> > serve slot request, no ResourceManager connected. Adding as pending
> request
> > [SlotRequestId{0cc24260eda3af299a0b321feefaf2cb}]
> > 2020-07-23 13:55:45,437 INFO
> >  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> > serve slot request, no ResourceManager connected. Adding as pending
> request
> > [SlotRequestId{240ca6f3d3b5ece6a98243ec8cadf616}]
> > 2020-07-23 13:55:45,438 INFO
> >  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> > serve slot request, no ResourceManager connected. Adding as pending
> request
> > [SlotRequestId{c35033d598a517acc108424bb9f809fb}]
> > 2020-07-23 13:55:45,438 INFO
> >  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> > serve slot request, no ResourceManager connected. Adding as pending
> request
> > [SlotRequestId{ad35013c3b532d4b4df1be62395ae0cf}]
> > 2020-07-23 13:55:45,438 INFO
> >  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> > serve slot request, no ResourceManager connected. Adding as pending
> request
> > [SlotRequestId{c929bd5e8daf432d01fad1ece3daec1a}]
> > 2020-07-23 13:55:45,487 INFO
> org.apache.flink.runtime.jobmaster.JobMaster
> >                 [] - Connecting to ResourceManager
> > akka.tcp://flink@flink-jobmanager
> > :6123/user/rpc/resourcemanager_*(00000000000000000000000000000000)
> > 2020-07-23 13:55:45,492 INFO
> org.apache.flink.runtime.jobmaster.JobMaster
> >                 [] - Resolved ResourceManager address, beginning
> > registration
> > 2020-07-23 13:55:45,493 INFO
> >  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> > Registering job manager 00000000000000000000000000000000
> > @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> > 99a030d0e3f428490a501c0132f27a56.
> > 2020-07-23 13:55:45,499 INFO
> >  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> > Registered job manager 00000000000000000000000000000000
> > @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> > 99a030d0e3f428490a501c0132f27a56.
> > 2020-07-23 13:55:45,501 INFO
> org.apache.flink.runtime.jobmaster.JobMaster
> >                 [] - JobManager successfully registered at
> ResourceManager,
> > leader id: 00000000000000000000000000000000.
> > 2020-07-23 13:55:45,501 INFO
> >  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> > Requesting new slot [SlotRequestId{15fd2a9565c2b080748c1d1592b1cbbc}] and
> > profile ResourceProfile{UNKNOWN} from resource manager.
> > 2020-07-23 13:55:45,502 INFO
> >  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> > Request slot with profile ResourceProfile{UNKNOWN} for job
> > 99a030d0e3f428490a501c0132f27a56 with allocation id
> > d420d08bf2654d9ea76955c70db18b69.
> > 2020-07-23 13:55:45,502 INFO
> >  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> > Requesting new slot [SlotRequestId{8cd72cc16f0e319d915a9a096a1096d7}] and
> > profile ResourceProfile{UNKNOWN} from resource manager.
> > 2020-07-23 13:55:45,503 INFO
> >  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> > Requesting new slot [SlotRequestId{e7e422409acebdb385014a9634af6a90}] and
> > profile ResourceProfile{UNKNOWN} from resource manager.
> > 2020-07-23 13:55:45,503 INFO
> >  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> > Requesting new slot [SlotRequestId{cef1af73546ca1fc27ca7a3322e9e815}] and
> > profile ResourceProfile{UNKNOWN} from resource manager.
> > 2020-07-23 13:55:45,503 INFO
> >  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> > Requesting new slot [SlotRequestId{108fe0b3086567ad79275eccef2fdaf8}] and
> > profile ResourceProfile{UNKNOWN} from resource manager.
> > 2020-07-23 13:55:45,503 INFO
> >  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> > Requesting new slot [SlotRequestId{265e67985eab7a6dc08024e53bf2708d}] and
> > profile ResourceProfile{UNKNOWN} from resource manager.
> > 2020-07-23 13:55:45,503 INFO
> >  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> > Requesting new slot [SlotRequestId{7087497a17c441f1a1d6fefcbc7cd0ea}] and
> > profile ResourceProfile{UNKNOWN} from resource manager.
> > 2020-07-23 13:55:45,503 INFO
> >  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> > Requesting new slot [SlotRequestId{14ac08438e79c8db8d25d93b99d62725}] and
> > profile ResourceProfile{UNKNOWN} from resource manager.
> >
> > 2020-07-23 13:55:45,514 INFO
> >  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> > Request slot with profile ResourceProfile{UNKNOWN} for job
> > 99a030d0e3f428490a501c0132f27a56 with allocation id
> > fce526bbe3e1be91caa3e4b536b20e35.
> > 2020-07-23 13:55:45,514 INFO
> >  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> > Requesting new slot [SlotRequestId{40c7abbb12514c405323b0569fb21647}] and
> > profile ResourceProfile{UNKNOWN} from resource manager.
> > 2020-07-23 13:55:45,514 INFO
> >  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> > Requesting new slot [SlotRequestId{a4985a9647b65b30a571258b45c8f2ce}] and
> > profile ResourceProfile{UNKNOWN} from resource manager.
> > 2020-07-23 13:55:45,515 INFO
> >  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> > Requesting new slot [SlotRequestId{c52a6eb2fa58050e71e7903590019fd1}] and
> > profile ResourceProfile{UNKNOWN} from resource manager.
> >
> > 2020-07-23 13:55:45,517 INFO
> >  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> > Request slot with profile ResourceProfile{UNKNOWN} for job
> > 99a030d0e3f428490a501c0132f27a56 with allocation id
> > 18ac7ec802ebfcfed8c05ee9324a55a4.
> >
> > 2020-07-23 13:55:45,518 INFO
> >  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> > Request slot with profile ResourceProfile{UNKNOWN} for job
> > 99a030d0e3f428490a501c0132f27a56 with allocation id
> > 7ec76cbe689eb418b63599e90ade19be.
> > 2020-07-23 13:55:45,518 INFO
> >  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> > Requesting new slot [SlotRequestId{46d65692a8b5aad11b51f9a74a666a74}] and
> > profile ResourceProfile{UNKNOWN} from resource manager.
> > 2020-07-23 13:55:45,518 INFO
> >  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> > Requesting new slot [SlotRequestId{3670bb4f345eedf941cc18e477ba1e9d}] and
> > profile ResourceProfile{UNKNOWN} from resource manager.
> > 2020-07-23 13:55:45,518 INFO
> >  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> > Requesting new slot [SlotRequestId{4a12467d76b9e3df8bc3412c0be08e14}] and
> > profile ResourceProfile{UNKNOWN} from resource manager.
> > 2020-07-23 13:55:45,518 INFO
> >  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> > Requesting new slot [SlotRequestId{e092b12b96b0a98bbf057e71b9705c23}] and
> > profile ResourceProfile{UNKNOWN} from resource manager.
> > 2020-07-23 13:55:45,518 INFO
> >  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> > Requesting new slot [SlotRequestId{4ad15f417716c9e07fca383990c0f52a}] and
> > profile ResourceProfile{UNKNOWN} from resource manager.
> > 2020-07-23 13:55:45,518 INFO
> >  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> > Requesting new slot [SlotRequestId{345fdb427a893b7fc3f4f040f93445d2}] and
> > profile ResourceProfile{UNKNOWN} from resource manager.
> > 2020-07-23 13:55:45,519 INFO
> >  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> > Requesting new slot [SlotRequestId{e559485ea7b0b7e17367816882538d90}] and
> > profile ResourceProfile{UNKNOWN} from resource manager.
> > 2020-07-23 13:55:45,519 INFO
> >  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> > Request slot with profile ResourceProfile{UNKNOWN} for job
> > 99a030d0e3f428490a501c0132f27a56 with allocation id
> > b78837a29b4032924ac25be70ed21a3c.
> >
> >
> > 2020-07-23 13:58:18,037 WARN
> >  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> > hostname could be resolved for the IP address 10.47.96.2, using IP
> address
> > as host name. Local input split assignment (such as for HDFS files) may
> be
> > impacted.
> > 2020-07-23 13:58:22,192 WARN
> >  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> > hostname could be resolved for the IP address 10.34.64.14, using IP
> address
> > as host name. Local input split assignment (such as for HDFS files) may
> be
> > impacted.
> > 2020-07-23 13:58:22,358 WARN
> >  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> > hostname could be resolved for the IP address 10.34.128.9, using IP
> address
> > as host name. Local input split assignment (such as for HDFS files) may
> be
> > impacted.
> > 2020-07-23 13:58:24,562 WARN
> >  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> > hostname could be resolved for the IP address 10.32.160.6, using IP
> address
> > as host name. Local input split assignment (such as for HDFS files) may
> be
> > impacted.
> > 2020-07-23 13:58:25,487 WARN
> >  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> > hostname could be resolved for the IP address 10.38.64.7, using IP
> address
> > as host name. Local input split assignment (such as for HDFS files) may
> be
> > impacted.
> > 2020-07-23 13:58:27,636 WARN
> >  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> > hostname could be resolved for the IP address 10.42.160.6, using IP
> address
> > as host name. Local input split assignment (such as for HDFS files) may
> be
> > impacted.
> > 2020-07-23 13:58:27,767 WARN
> >  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> > hostname could be resolved for the IP address 10.43.64.12, using IP
> address
> > as host name. Local input split assignment (such as for HDFS files) may
> be
> > impacted.
> > 2020-07-23 13:58:29,651 INFO
> >  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> > The heartbeat of JobManager with id 456a18b6c404cb11a359718e16de1c6b
> timed
> > out.
> > 2020-07-23 13:58:29,651 INFO
> >  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> > Disconnect job manager 00000000000000000000000000000000
> > @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> > 99a030d0e3f428490a501c0132f27a56 from the resource manager.
> > 2020-07-23 13:58:29,854 WARN
> >  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> > hostname could be resolved for the IP address 10.39.0.8, using IP address
> > as host name. Local input split assignment (such as for HDFS files) may
> be
> > impacted.
> > 2020-07-23 13:58:33,623 WARN
> >  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> > hostname could be resolved for the IP address 10.35.0.10, using IP
> address
> > as host name. Local input split assignment (such as for HDFS files) may
> be
> > impacted.
> > 2020-07-23 13:58:35,756 WARN
> >  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> > hostname could be resolved for the IP address 10.36.32.8, using IP
> address
> > as host name. Local input split assignment (such as for HDFS files) may
> be
> > impacted.
> > 2020-07-23 13:58:36,694 WARN
> >  org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> > hostname could be resolved for the IP address 10.42.128.6, using IP
> address
> > as host name. Local input split assignment (such as for HDFS files) may
> be
> > impacted.
> >
> >
> > 2020-07-23 14:01:17,814 INFO
> org.apache.flink.runtime.jobmaster.JobMaster
> >                 [] - Close ResourceManager connection
> > 83b1ff14900abfd54418e7fa3efb3f8a: The heartbeat of JobManager with id
> > 456a18b6c404cb11a359718e16de1c6b timed out..
> > 2020-07-23 14:01:17,815 INFO
> org.apache.flink.runtime.jobmaster.JobMaster
> >                 [] - Connecting to ResourceManager
> > akka.tcp://flink@flink-jobmanager
> > :6123/user/rpc/resourcemanager_*(00000000000000000000000000000000)
> > 2020-07-23 14:01:17,816 INFO
> org.apache.flink.runtime.jobmaster.JobMaster
> >                 [] - Resolved ResourceManager address, beginning
> > registration
> > 2020-07-23 14:01:17,816 INFO
> >  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> > Registering job manager 00000000000000000000000000000000
> > @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> > 99a030d0e3f428490a501c0132f27a56.
> > 2020-07-23 14:01:17,836 INFO
> >  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> Source:
> > host_relation -> Timestamps/Watermarks -> Map (1/1)
> > (302ca9640e2d209a543d843f2996ccd2) switched from SCHEDULED to FAILED on
> not
> > deployed.
> >
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> > Could not allocate the required slot within slot request timeout. Please
> > make sure that the cluster has enough resources.
> >      at
> >
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeWrapWithNoResourceAvailableException(DefaultScheduler.java:441)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$assignResourceOrHandleError$6(DefaultScheduler.java:422)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> > ~[?:1.8.0_242]
> >      at
> >
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:168)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> > ~[?:1.8.0_242]
> >      at
> >
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:726)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:537)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:432)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> > ~[?:1.8.0_242]
> >      at
> >
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$forwardTo$21(FutureUtils.java:1120)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> > ~[?:1.8.0_242]
> >      at
> >
> org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1036)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> > akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at akka.actor.ActorCell.invoke(ActorCell.scala:561)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at akka.dispatch.Mailbox.run(Mailbox.scala:225)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> > akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> > Caused by: java.util.concurrent.CompletionException:
> > java.util.concurrent.TimeoutException
> >      at
> >
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
> > ~[?:1.8.0_242]
> >      ... 25 more
> > Caused by: java.util.concurrent.TimeoutException
> >      ... 23 more
> > 2020-07-23 14:01:17,848 INFO
> >
> org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
> > [] - Calculating tasks to restart to recover the failed task
> > cbc357ccb763df2852fee8c4fc7d55f2_0.
> > 2020-07-23 14:01:17,910 INFO
> >
> org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
> > [] - 902 tasks should be restarted to recover the failed task
> > cbc357ccb763df2852fee8c4fc7d55f2_0.
> > 2020-07-23 14:01:17,913 INFO
> >  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job
> > JobTest (99a030d0e3f428490a501c0132f27a56) switched from state RUNNING to
> > FAILING.
> > org.apache.flink.runtime.JobException: Recovery is suppressed by
> > NoRestartBackoffTimeStrategy
> >      at
> >
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:116)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:78)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:192)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:185)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:179)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:503)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyTaskFailure(UpdateSchedulerNgOnInternalFailuresListener.java:49)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.executiongraph.ExecutionGraph.notifySchedulerNgAboutInternalTaskFailure(ExecutionGraph.java:1710)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1287)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1255)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.executiongraph.Execution.markFailed(Execution.java:1086)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.executiongraph.ExecutionVertex.markFailed(ExecutionVertex.java:748)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.scheduler.DefaultExecutionVertexOperations.markFailed(DefaultExecutionVertexOperations.java:41)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskDeploymentFailure(DefaultScheduler.java:435)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$assignResourceOrHandleError$6(DefaultScheduler.java:422)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> > ~[?:1.8.0_242]
> >      at
> >
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:168)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> > ~[?:1.8.0_242]
> >      at
> >
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:726)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:537)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:432)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> > ~[?:1.8.0_242]
> >      at
> >
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$forwardTo$21(FutureUtils.java:1120)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> > ~[?:1.8.0_242]
> >      at
> >
> org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1036)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> > akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at akka.actor.ActorCell.invoke(ActorCell.scala:561)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at akka.dispatch.Mailbox.run(Mailbox.scala:225)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> > akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> > Caused by:
> >
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> > Could not allocate the required slot within slot request timeout. Please
> > make sure that the cluster has enough resources.
> >      at
> >
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeWrapWithNoResourceAvailableException(DefaultScheduler.java:441)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      ... 45 more
> > Caused by: java.util.concurrent.CompletionException:
> > java.util.concurrent.TimeoutException
> >      at
> >
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
> > ~[?:1.8.0_242]
> >      ... 25 more
> > Caused by: java.util.concurrent.TimeoutException
> >      ... 23 more
> >
> >
> >
> > 2020-07-23 14:01:18,109 INFO
> >  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> > Discarding the results produced by task execution
> > 1809eb912d69854f2babedeaf879df6a.
> > 2020-07-23 14:01:18,110 INFO
> >  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job
> > JobTest (99a030d0e3f428490a501c0132f27a56) switched from state FAILING to
> > FAILED.
> > org.apache.flink.runtime.JobException: Recovery is suppressed by
> > NoRestartBackoffTimeStrategy
> >      at
> >
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:116)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:78)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:192)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:185)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:179)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:503)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyTaskFailure(UpdateSchedulerNgOnInternalFailuresListener.java:49)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.executiongraph.ExecutionGraph.notifySchedulerNgAboutInternalTaskFailure(ExecutionGraph.java:1710)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1287)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1255)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.executiongraph.Execution.markFailed(Execution.java:1086)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.executiongraph.ExecutionVertex.markFailed(ExecutionVertex.java:748)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.scheduler.DefaultExecutionVertexOperations.markFailed(DefaultExecutionVertexOperations.java:41)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskDeploymentFailure(DefaultScheduler.java:435)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$assignResourceOrHandleError$6(DefaultScheduler.java:422)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> > ~[?:1.8.0_242]
> >      at
> >
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:168)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> > ~[?:1.8.0_242]
> >      at
> >
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:726)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:537)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:432)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> > ~[?:1.8.0_242]
> >      at
> >
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$forwardTo$21(FutureUtils.java:1120)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> > ~[?:1.8.0_242]
> >      at
> >
> org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1036)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> > akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at akka.actor.ActorCell.invoke(ActorCell.scala:561)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at akka.dispatch.Mailbox.run(Mailbox.scala:225)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> > akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> >      at
> >
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> > [flink-dist_2.11-1.11.1.jar:1.11.1]
> > Caused by:
> >
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> > Could not allocate the required slot within slot request timeout. Please
> > make sure that the cluster has enough resources.
> >      at
> >
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeWrapWithNoResourceAvailableException(DefaultScheduler.java:441)
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> >      ... 45 more
> > Caused by: java.util.concurrent.CompletionException:
> > java.util.concurrent.TimeoutException
> >      at
> >
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
> > ~[?:1.8.0_242]
> >      at
> >
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
> > ~[?:1.8.0_242]
> >      ... 25 more
> > Caused by: java.util.concurrent.TimeoutException
> >      ... 23 more
> > 2020-07-23 14:01:18,114 INFO
> >  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] -
> Stopping
> > checkpoint coordinator for job 99a030d0e3f428490a501c0132f27a56.
> > 2020-07-23 14:01:18,117 INFO
> >  org.apache.flink.runtime.checkpoint.StandaloneCompletedCheckpointStore
> []
> > - Shutting down
> > 2020-07-23 14:01:18,118 INFO
> >  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> > Discarding the results produced by task execution
> > 302ca9640e2d209a543d843f2996ccd2.
> > 2020-07-23 14:01:18,120 INFO
> >  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Pending
> > slot request [SlotRequestId{15fd2a9565c2b080748c1d1592b1cbbc}] timed out.
> > 2020-07-23 14:01:18,120 INFO
> >  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Pending
> > slot request [SlotRequestId{8cd72cc16f0e319d915a9a096a1096d7}] timed out.
> > 2020-07-23 14:01:18,120 INFO
> >  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Pending
> > slot request [SlotRequestId{e7e422409acebdb385014a9634af6a90}] timed out.
> > 2020-07-23 14:01:18,121 INFO
> >  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Pending
> > slot request [SlotRequestId{cef1af73546ca1fc27ca7a3322e9e815}] timed out.
> > 2020-07-23 14:01:18,121 INFO
> >  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Pending
> > slot request [SlotRequestId{108fe0b3086567ad79275eccef2fdaf8}] timed out.
> > 2020-07-23 14:01:18,121 INFO
> >  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Pending
> > slot request [SlotRequestId{265e67985eab7a6dc08024e53bf2708d}] timed out.
> > 2020-07-23 14:01:18,122 INFO
> >  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Pending
> > slot request [SlotRequestId{7087497a17c441f1a1d6fefcbc7cd0ea}] timed out.
> > 2020-07-23 14:01:18,122 INFO
> >  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Pending
> > slot request [
> >
> >
> > 2020-07-23 14:01:18,151 INFO
> >  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> > Registering job manager 00000000000000000000000000000000
> > @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> > 99a030d0e3f428490a501c0132f27a56.
> > 2020-07-23 14:01:18,157 INFO
> >  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> > Registered job manager 00000000000000000000000000000000
> > @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> > 99a030d0e3f428490a501c0132f27a56.
> > 2020-07-23 14:01:18,157 INFO
> >  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> > Registered job manager 00000000000000000000000000000000
> > @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> > 99a030d0e3f428490a501c0132f27a56.
> > 2020-07-23 14:01:18,157 INFO
> >  org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Job
> > 99a030d0e3f428490a501c0132f27a56 reached globally terminal state FAILED.
> > 2020-07-23 14:01:18,162 INFO
> >  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> > Registered job manager 00000000000000000000000000000000
> > @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> > 99a030d0e3f428490a501c0132f27a56.
> > 2020-07-23 14:01:18,162 INFO
> org.apache.flink.runtime.jobmaster.JobMaster
> >                 [] - JobManager successfully registered at
> ResourceManager,
> > leader id: 00000000000000000000000000000000.
> > 2020-07-23 14:01:18,225 INFO
> org.apache.flink.runtime.jobmaster.JobMaster
> >                 [] - Stopping the JobMaster for job
> > JobTest(99a030d0e3f428490a501c0132f27a56).
> > 2020-07-23 14:01:18,381 INFO
> >  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> > Suspending SlotPool.
> > 2020-07-23 14:01:18,382 INFO
> org.apache.flink.runtime.jobmaster.JobMaster
> >                 [] - Close ResourceManager connection
> > 83b1ff14900abfd54418e7fa3efb3f8a: JobManager is shutting down..
> > 2020-07-23 14:01:18,382 INFO
> >  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Stopping
> > SlotPool.
> > 2020-07-23 14:01:18,382 INFO
> >  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> > Disconnect job manager 00000000000000000000000000000000
> > @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> > 99a030d0e3f428490a501c0132f27a56 from the resource manager.
> >
> > a511955993
> > 邮箱:[hidden email]
> >
> > <
> https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=a511955993&uid=a511955993%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22%E9%82%AE%E7%AE%B1%EF%BC%9Aa511955993%40163.com%22%5D&gt
> ;
> >
> > 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail88&gt; 定制
> >
> > On 07/23/2020 13:26, Yang Wang <[hidden email]> wrote:
> > 很高兴你的问题解决了,但我觉得根本原因应该不是加上了taskmanager-query-state-service.yaml的关系。
> > 我这边不创建这个服务也是正常的,而且nslookup {tm_ip_address}是可以正常反解析到hostname的。
> >
> > 注意这里不是解析hostname,而是通过ip地址来反解析进行验证
> >
> >
> > 回答你说的两个问题:
> > 1. 不是必须的,我这边验证不需要创建,集群也是可以正常运行任务的。Rest
> > service的暴露方式是ClusterIP、NodePort、LoadBalancer都正常
> > 2. 如果没有配置taskmanager.bind-host,
> > [Flink-15911][Flink-15154]这两个JIRA并不会影响TM向RM注册时候的使用的地址
> >
> > 如果你想找到根本原因,那可能需要你这边提供JM/TM的完整log,这样方便分析
> >
> >
> > Best,
> > Yang
> >
> > SmileSmile <[hidden email]> 于2020年7月23日周四 上午11:30写道:
> >
> > >
> > > Hi Yang Wang
> > >
> > > 刚刚在测试环境测试了一下,taskManager没有办法nslookup出来,JM可以nslookup,这两者的差别在于是否有service。
> > >
> > > 解决方案:我这边给集群加上了taskmanager-query-state-service.yaml(按照官网上是可选服务)。就不会刷No
> > > hostname could be resolved for ip
> > > address,将NodePort改为ClusterIp,作业就可以成功提交,不会出现time out的问题了,问题得到了解决。
> > >
> > >
> > > 1. 如果按照上面的情况,那么这个配置文件是必须配置的?
> > >
> > > 2. 在1.11的更新中,发现有 [Flink-15911][Flink-15154]
> > > 支持分别配置用于本地监听绑定的网络接口和外部访问的地址和端口。是否是这块的改动,
> > > 需要JM去通过TM上报的ip反向解析出service?
> > >
> > >
> > > Bset!
> > >
> > >
> > > [1]
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html
> > >
> > > a511955993
> > > 邮箱:[hidden email]
> > >
> > > <
> >
> https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=a511955993&uid=a511955993%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22%E9%82%AE%E7%AE%B1%EF%BC%9Aa511955993%40163.com%22%5D&gt
> ;
> >
> > >
> > > 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail88&gt; 定制
> > >
> > > On 07/23/2020 10:11, Yang Wang <[hidden email]> wrote:
> > > 我的意思就是你在Flink任务运行的过程中,然后下面的命令在集群里面起一个busybox的pod,
> > > 在里面执行 nslookup {ip_address},看看是否能够正常解析到。如果不能应该就是coredns的
> > > 问题了
> > >
> > > kubectl run -i -t busybox --image=busybox --restart=Never
> > >
> > > 你需要确认下集群的coredns pod是否正常,一般是部署在kube-system这个namespace下的
> > >
> > >
> > >
> > > Best,
> > > Yang
> > >
> > >
> > > SmileSmile <[hidden email]> 于2020年7月22日周三 下午7:57写道:
> > >
> > > >
> > > > Hi,Yang Wang!
> > > >
> > > > 很开心可以收到你的回复,你的回复帮助很大,让我知道了问题的方向。我再补充些信息,希望可以帮我进一步判断一下问题根源。
> > > >
> > > > 在JM报错的地方,No hostname could be resolved for ip address xxxxx
> > > > ,报出来的ip是k8s分配给flink pod的内网ip,不是宿主机的ip。请问这个问题可能出在哪里呢
> > > >
> > > > Best!
> > > >
> > > >
> > > > a511955993
> > > > 邮箱:[hidden email]
> > > >
> > > > <
> > >
> >
> https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=a511955993&uid=a511955993%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22%E9%82%AE%E7%AE%B1%EF%BC%9Aa511955993%40163.com%22%5D&gt
> ;
> >
> > >
> > > >
> > > > 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail88&gt; 定制
> > > >
> > > > On 07/22/2020 18:18, Yang Wang <[hidden email]> wrote:
> > > > 如果你的日志里面一直在刷No hostname could be resolved for the IP
> > > address,应该是集群的coredns
> > > > 有问题,由ip地址反查hostname查不到。你可以起一个busybox验证一下是不是这个ip就解析不了,有
> > > > 可能是coredns有问题
> > > >
> > > >
> > > > Best,
> > > > Yang
> > > >
> > > > Congxian Qiu <[hidden email]> 于2020年7月21日周二 下午7:29写道:
> > > >
> > > > > Hi
> > > > >    不确定 k8s 环境中能否看到 pod 的完整日志?类似 Yarn 的 NM 日志一样,如果有的话,可以尝试看一下这个 pod
> > > > > 的完整日志有没有什么发现
> > > > > Best,
> > > > > Congxian
> > > > >
> > > > >
> > > > > SmileSmile <[hidden email]> 于2020年7月21日周二 下午3:19写道:
> > > > >
> > > > > > Hi,Congxian
> > > > > >
> > > > > > 因为是测试环境,没有配置HA,目前看到的信息,就是JM刷出来大量的no hostname could be
> > > > > > resolved,jm失联,作业提交失败。
> > > > > > 将jm内存配置为10g也是一样的情况(jobmanager.memory.pprocesa.size:10240m)。
> > > > > >
> > > > > > 在同一个环境将版本回退到1.10没有出现该问题,也不会刷如上报错。
> > > > > >
> > > > > >
> > > > > > 是否有其他排查思路?
> > > > > >
> > > > > > Best!
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > | |
> > > > > > a511955993
> > > > > > |
> > > > > > |
> > > > > > 邮箱:[hidden email]
> > > > > > |
> > > > > >
> > > > > > 签名由 网易邮箱大师 定制
> > > > > >
> > > > > > On 07/16/2020 13:17, Congxian Qiu wrote:
> > > > > > Hi
> > > > > >   如果没有异常,GC 情况也正常的话,或许可以看一下 pod 的相关日志,如果开启了 HA 也可以看一下 zk
> > > 的日志。之前遇到过一次在
> > > > > Yarn
> > > > > > 环境中类似的现象是由于其他原因导致的,通过看 NM 日志以及 zk 日志发现的原因。
> > > > > >
> > > > > > Best,
> > > > > > Congxian
> > > > > >
> > > > > >
> > > > > > SmileSmile <[hidden email]> 于2020年7月15日周三 下午5:20写道:
> > > > > >
> > > > > > > Hi Roc
> > > > > > >
> > > > > > > 该现象在1.10.1版本没有,在1.11版本才出现。请问这个该如何查比较合适
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > | |
> > > > > > > a511955993
> > > > > > > |
> > > > > > > |
> > > > > > > 邮箱:[hidden email]
> > > > > > > |
> > > > > > >
> > > > > > > 签名由 网易邮箱大师 定制
> > > > > > >
> > > > > > > On 07/15/2020 17:16, Roc Marshal wrote:
> > > > > > > Hi,SmileSmile.
> > > > > > > 个人之前有遇到过 类似 的host解析问题,可以从k8s的pod节点网络映射角度排查一下。
> > > > > > > 希望这对你有帮助。
> > > > > > >
> > > > > > >
> > > > > > > 祝好。
> > > > > > > Roc Marshal
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > 在 2020-07-15 17:04:18,"SmileSmile" <[hidden email]> 写道:
> > > > > > > >
> > > > > > > >Hi
> > > > > > > >
> > > > > > > >使用版本Flink 1.11,部署方式 kubernetes session。 TM个数30个,每个TM 4个slot。
> > job
> > > > > > > 并行度120.提交作业的时候出现大量的No hostname could be resolved for the IP
> > > > address,JM
> > > > > > time
> > > > > > > out,作业提交失败。web ui也会卡主无响应。
> > > > > > > >
> > > > > > > >用wordCount,并行度只有1提交也会刷,no
> > hostname的日志会刷个几条,然后正常提交,如果并行度一上去,就会超时。
> > > > > > > >
> > > > > > > >
> > > > > > > >部分日志如下:
> > > > > > > >
> > > > > > > >2020-07-15 16:58:46,460 WARN
> > > > > > > org.apache.flink.runtime.taskmanager.TaskManagerLocation     []
> > -
> > > No
> > > > > > > hostname could be resolved for the IP address 10.32.160.7,
> using
> > > IP
> > > > > > address
> > > > > > > as host name. Local input split assignment (such as for HDFS
> > > files)
> > > > may
> > > > > > be
> > > > > > > impacted.
> > > > > > > >2020-07-15 16:58:46,460 WARN
> > > > > > > org.apache.flink.runtime.taskmanager.TaskManagerLocation     []
> > -
> > > No
> > > > > > > hostname could be resolved for the IP address 10.44.224.7,
> using
> > > IP
> > > > > > address
> > > > > > > as host name. Local input split assignment (such as for HDFS
> > > files)
> > > > may
> > > > > > be
> > > > > > > impacted.
> > > > > > > >2020-07-15 16:58:46,461 WARN
> > > > > > > org.apache.flink.runtime.taskmanager.TaskManagerLocation     []
> > -
> > > No
> > > > > > > hostname could be resolved for the IP address 10.40.32.9, using
> > IP
> > > > > > address
> > > > > > > as host name. Local input split assignment (such as for HDFS
> > > files)
> > > > may
> > > > > > be
> > > > > > > impacted.
> > > > > > > >
> > > > > > > >2020-07-15 16:59:10,236 INFO
> > > > > > >
> > org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
> > > > [] -
> > > > > > The
> > > > > > > heartbeat of JobManager with id
> 69a0d460de468888a9f41c770d963c0a
> > > > timed
> > > > > > out.
> > > > > > > >2020-07-15 16:59:10,236 INFO
> > > > > > >
> > org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
> > > > [] -
> > > > > > > Disconnect job manager 00000000000000000000000000000000
> > > > > > > @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2
> > for
> > > > job
> > > > > > > e1554c737e37ed79688a15c746b6e9ef from the resource manager.
> > > > > > > >
> > > > > > > >
> > > > > > > >how to deal with ?
> > > > > > > >
> > > > > > > >
> > > > > > > >beset !
> > > > > > > >
> > > > > > > >| |
> > > > > > > >a511955993
> > > > > > > >|
> > > > > > > >|
> > > > > > > >邮箱:[hidden email]
> > > > > > > >|
> > > > > > > >
> > > > > > > >签名由 网易邮箱大师 定制
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.11 submit job timed out

Meng Wang
遇到了同样的问题,也是启动了 taskmanager-query-state-service.yaml 这个服务后,作业才能正常提交的,另外我是在本地装的 k8s 集群进行测试的,如果是 GC 的问题,启不启动 TM service 应该不会有影响的


--

Best,
Matt Wang


On 07/27/2020 15:01,Yang Wang<[hidden email]> wrote:
建议先配置heartbeat.timeout的值大一些,然后把gc log打出来
看看是不是经常发生fullGC,每次持续时间是多长,从你目前提供的log看,进程内JM->RM都会心跳超时
怀疑还是和GC有关的

env.java.opts.jobmanager: -Xloggc:<LOG_DIR>/jobmanager-gc.log
-XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=2 -XX:GCLogFileSize=512M


Best,
Yang

SmileSmile <[hidden email]> 于2020年7月27日周一 下午1:50写道:

Hi,Yang Wang

因为日志太长了,删了一些重复的内容。
一开始怀疑过jm gc的问题,将jm的内存调整为10g也是一样的情况。

Best



| |
a511955993
|
|
邮箱:[hidden email]
|

签名由 网易邮箱大师 定制

On 07/27/2020 11:36, Yang Wang wrote:
看你这个任务,失败的根本原因并不是“No hostname could be resolved
”,这个WARNING的原因可以单独讨论(如果在1.10里面不存在的话)。
你可以本地起一个Standalone的集群,也会有这样的WARNING,并不影响正常使用


失败的原因是slot 5分钟申请超时了,你给的日志里面2020-07-23 13:55:45,519到2020-07-23
13:58:18,037是空白的,没有进行省略吧?
这段时间按理应该是task开始deploy了。在日志里看到了JM->RM的心跳超时,同一个Pod里面的同一个进程通信也超时了
所以怀疑JM一直在FullGC,这个需要你确认一下


Best,
Yang

SmileSmile <[hidden email]> 于2020年7月23日周四 下午2:43写道:

Hi Yang Wang

先分享下我这边的环境版本


kubernetes:1.17.4.   CNI: weave


1 2 3 是我的一些疑惑

4 是JM日志


1. 去掉taskmanager-query-state-service.yaml后确实不行  nslookup

kubectl exec -it busybox2 -- /bin/sh
/ # nslookup 10.47.96.2
Server:          10.96.0.10
Address:     10.96.0.10:53

** server can't find 2.96.47.10.in-addr.arpa: NXDOMAIN



2. Flink1.11和Flink1.10

detail subtasks taskmanagers xxx x 这行

1.11变成了172-20-0-50。1.10是flink-taskmanager-7b5d6958b6-sfzlk:36459。这块的改动是?(目前这个集群跑着1.10和1.11,1.10可以正常运行,如果coredns有问题,1.10版本的flink应该也有一样的情况吧?)

3. coredns是否特殊配置?

在容器中解析域名是正常的,只是反向解析没有service才会有问题。coredns是否有什么需要配置?


4. time out时候的JM日志如下:



2020-07-23 13:53:00,228 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
ResourceManager akka.tcp://flink@flink-jobmanager
:6123/user/rpc/resourcemanager_0
was granted leadership with fencing token
00000000000000000000000000000000
2020-07-23 13:53:00,232 INFO
org.apache.flink.runtime.rpc.akka.AkkaRpcService             [] -
Starting
RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher
at akka://flink/user/rpc/dispatcher_1 .
2020-07-23 13:53:00,233 INFO
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl []
-
Starting the SlotManager.
2020-07-23 13:53:03,472 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Registering TaskManager with ResourceID 1f9ae0cd95a28943a73be26323588696
(akka.tcp://flink@10.34.128.9:6122/user/rpc/taskmanager_0) at
ResourceManager
2020-07-23 13:53:03,777 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Registering TaskManager with ResourceID cac09e751264e61615329c20713a84b4
(akka.tcp://flink@10.32.160.6:6122/user/rpc/taskmanager_0) at
ResourceManager
2020-07-23 13:53:03,787 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Registering TaskManager with ResourceID 93c72d01d09f9ae427c5fc980ed4c1e4
(akka.tcp://flink@10.39.0.8:6122/user/rpc/taskmanager_0) at
ResourceManager
2020-07-23 13:53:04,044 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Registering TaskManager with ResourceID 8adf2f8e81b77a16d5418a9e252c61e2
(akka.tcp://flink@10.38.64.7:6122/user/rpc/taskmanager_0) at
ResourceManager
2020-07-23 13:53:04,099 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Registering TaskManager with ResourceID 23e9d2358f6eb76b9ae718d879d4f330
(akka.tcp://flink@10.42.160.6:6122/user/rpc/taskmanager_0) at
ResourceManager
2020-07-23 13:53:04,146 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Registering TaskManager with ResourceID 092f8dee299e32df13db3111662b61f8
(akka.tcp://flink@10.33.192.14:6122/user/rpc/taskmanager_0) at
ResourceManager


2020-07-23 13:55:44,220 INFO
org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] -
Received
JobGraph submission 99a030d0e3f428490a501c0132f27a56 (JobTest).
2020-07-23 13:55:44,222 INFO
org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] -
Submitting job 99a030d0e3f428490a501c0132f27a56 (JobTest).
2020-07-23 13:55:44,251 INFO
org.apache.flink.runtime.rpc.akka.AkkaRpcService             [] -
Starting
RPC endpoint for org.apache.flink.runtime.jobmaster.JobMaster at
akka://flink/user/rpc/jobmanager_2 .
2020-07-23 13:55:44,260 INFO
org.apache.flink.runtime.jobmaster.JobMaster
[] - Initializing job JobTest
(99a030d0e3f428490a501c0132f27a56).
2020-07-23 13:55:44,278 INFO
org.apache.flink.runtime.jobmaster.JobMaster
[] - Using restart back off time strategy
NoRestartBackoffTimeStrategy for JobTest
(99a030d0e3f428490a501c0132f27a56).
2020-07-23 13:55:44,319 INFO
org.apache.flink.runtime.jobmaster.JobMaster
[] - Running initialization on master for job JobTest
(99a030d0e3f428490a501c0132f27a56).
2020-07-23 13:55:44,319 INFO
org.apache.flink.runtime.jobmaster.JobMaster
[] - Successfully ran initialization on master in 0 ms.
2020-07-23 13:55:44,428 INFO
org.apache.flink.runtime.scheduler.adapter.DefaultExecutionTopology [] -
Built 1 pipelined regions in 25 ms
2020-07-23 13:55:44,437 INFO
org.apache.flink.runtime.jobmaster.JobMaster
[] - Loading state backend via factory
org.apache.flink.contrib.streaming.state.RocksDBStateBackendFactory
2020-07-23 13:55:44,456 INFO
org.apache.flink.contrib.streaming.state.RocksDBStateBackend [] - Using
predefined options: DEFAULT.
2020-07-23 13:55:44,457 INFO
org.apache.flink.contrib.streaming.state.RocksDBStateBackend [] - Using
default options factory:
DefaultConfigurableOptionsFactory{configuredOptions={}}.
2020-07-23 13:55:44,466 WARN  org.apache.flink.runtime.util.HadoopUtils
[] - Could not find Hadoop configuration via any of the
supported methods (Flink configuration, environment variables).
2020-07-23 13:55:45,276 INFO
org.apache.flink.runtime.jobmaster.JobMaster
[] - Using failover strategy

org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@72bd8533
for JobTest (99a030d0e3f428490a501c0132f27a56).
2020-07-23 13:55:45,280 INFO
org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl      [] -
JobManager runner for job JobTest (99a030d0e3f428490a501c0132f27a56) was
granted leadership with session id 00000000-0000-0000-0000-000000000000
at
akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2.
2020-07-23 13:55:45,286 INFO
org.apache.flink.runtime.jobmaster.JobMaster
[] - Starting scheduling with scheduling strategy
[org.apache.flink.runtime.scheduler.strategy.EagerSchedulingStrategy]



2020-07-23 13:55:45,436 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
serve slot request, no ResourceManager connected. Adding as pending
request
[SlotRequestId{e092b12b96b0a98bbf057e71b9705c23}]
2020-07-23 13:55:45,436 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
serve slot request, no ResourceManager connected. Adding as pending
request
[SlotRequestId{4ad15f417716c9e07fca383990c0f52a}]
2020-07-23 13:55:45,436 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
serve slot request, no ResourceManager connected. Adding as pending
request
[SlotRequestId{345fdb427a893b7fc3f4f040f93445d2}]
2020-07-23 13:55:45,437 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
serve slot request, no ResourceManager connected. Adding as pending
request
[SlotRequestId{e559485ea7b0b7e17367816882538d90}]
2020-07-23 13:55:45,437 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
serve slot request, no ResourceManager connected. Adding as pending
request
[SlotRequestId{7be8f6c1aedb27b04e7feae68078685c}]
2020-07-23 13:55:45,437 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
serve slot request, no ResourceManager connected. Adding as pending
request
[SlotRequestId{582a86197884206652dff3aea2306bb3}]
2020-07-23 13:55:45,437 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
serve slot request, no ResourceManager connected. Adding as pending
request
[SlotRequestId{0cc24260eda3af299a0b321feefaf2cb}]
2020-07-23 13:55:45,437 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
serve slot request, no ResourceManager connected. Adding as pending
request
[SlotRequestId{240ca6f3d3b5ece6a98243ec8cadf616}]
2020-07-23 13:55:45,438 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
serve slot request, no ResourceManager connected. Adding as pending
request
[SlotRequestId{c35033d598a517acc108424bb9f809fb}]
2020-07-23 13:55:45,438 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
serve slot request, no ResourceManager connected. Adding as pending
request
[SlotRequestId{ad35013c3b532d4b4df1be62395ae0cf}]
2020-07-23 13:55:45,438 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
serve slot request, no ResourceManager connected. Adding as pending
request
[SlotRequestId{c929bd5e8daf432d01fad1ece3daec1a}]
2020-07-23 13:55:45,487 INFO
org.apache.flink.runtime.jobmaster.JobMaster
[] - Connecting to ResourceManager
akka.tcp://flink@flink-jobmanager
:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000)
2020-07-23 13:55:45,492 INFO
org.apache.flink.runtime.jobmaster.JobMaster
[] - Resolved ResourceManager address, beginning
registration
2020-07-23 13:55:45,493 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Registering job manager 00000000000000000000000000000000
@akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
99a030d0e3f428490a501c0132f27a56.
2020-07-23 13:55:45,499 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Registered job manager 00000000000000000000000000000000
@akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
99a030d0e3f428490a501c0132f27a56.
2020-07-23 13:55:45,501 INFO
org.apache.flink.runtime.jobmaster.JobMaster
[] - JobManager successfully registered at
ResourceManager,
leader id: 00000000000000000000000000000000.
2020-07-23 13:55:45,501 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Requesting new slot [SlotRequestId{15fd2a9565c2b080748c1d1592b1cbbc}] and
profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,502 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Request slot with profile ResourceProfile{UNKNOWN} for job
99a030d0e3f428490a501c0132f27a56 with allocation id
d420d08bf2654d9ea76955c70db18b69.
2020-07-23 13:55:45,502 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Requesting new slot [SlotRequestId{8cd72cc16f0e319d915a9a096a1096d7}] and
profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,503 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Requesting new slot [SlotRequestId{e7e422409acebdb385014a9634af6a90}] and
profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,503 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Requesting new slot [SlotRequestId{cef1af73546ca1fc27ca7a3322e9e815}] and
profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,503 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Requesting new slot [SlotRequestId{108fe0b3086567ad79275eccef2fdaf8}] and
profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,503 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Requesting new slot [SlotRequestId{265e67985eab7a6dc08024e53bf2708d}] and
profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,503 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Requesting new slot [SlotRequestId{7087497a17c441f1a1d6fefcbc7cd0ea}] and
profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,503 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Requesting new slot [SlotRequestId{14ac08438e79c8db8d25d93b99d62725}] and
profile ResourceProfile{UNKNOWN} from resource manager.

2020-07-23 13:55:45,514 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Request slot with profile ResourceProfile{UNKNOWN} for job
99a030d0e3f428490a501c0132f27a56 with allocation id
fce526bbe3e1be91caa3e4b536b20e35.
2020-07-23 13:55:45,514 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Requesting new slot [SlotRequestId{40c7abbb12514c405323b0569fb21647}] and
profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,514 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Requesting new slot [SlotRequestId{a4985a9647b65b30a571258b45c8f2ce}] and
profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,515 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Requesting new slot [SlotRequestId{c52a6eb2fa58050e71e7903590019fd1}] and
profile ResourceProfile{UNKNOWN} from resource manager.

2020-07-23 13:55:45,517 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Request slot with profile ResourceProfile{UNKNOWN} for job
99a030d0e3f428490a501c0132f27a56 with allocation id
18ac7ec802ebfcfed8c05ee9324a55a4.

2020-07-23 13:55:45,518 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Request slot with profile ResourceProfile{UNKNOWN} for job
99a030d0e3f428490a501c0132f27a56 with allocation id
7ec76cbe689eb418b63599e90ade19be.
2020-07-23 13:55:45,518 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Requesting new slot [SlotRequestId{46d65692a8b5aad11b51f9a74a666a74}] and
profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,518 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Requesting new slot [SlotRequestId{3670bb4f345eedf941cc18e477ba1e9d}] and
profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,518 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Requesting new slot [SlotRequestId{4a12467d76b9e3df8bc3412c0be08e14}] and
profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,518 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Requesting new slot [SlotRequestId{e092b12b96b0a98bbf057e71b9705c23}] and
profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,518 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Requesting new slot [SlotRequestId{4ad15f417716c9e07fca383990c0f52a}] and
profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,518 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Requesting new slot [SlotRequestId{345fdb427a893b7fc3f4f040f93445d2}] and
profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,519 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Requesting new slot [SlotRequestId{e559485ea7b0b7e17367816882538d90}] and
profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,519 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Request slot with profile ResourceProfile{UNKNOWN} for job
99a030d0e3f428490a501c0132f27a56 with allocation id
b78837a29b4032924ac25be70ed21a3c.


2020-07-23 13:58:18,037 WARN
org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
hostname could be resolved for the IP address 10.47.96.2, using IP
address
as host name. Local input split assignment (such as for HDFS files) may
be
impacted.
2020-07-23 13:58:22,192 WARN
org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
hostname could be resolved for the IP address 10.34.64.14, using IP
address
as host name. Local input split assignment (such as for HDFS files) may
be
impacted.
2020-07-23 13:58:22,358 WARN
org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
hostname could be resolved for the IP address 10.34.128.9, using IP
address
as host name. Local input split assignment (such as for HDFS files) may
be
impacted.
2020-07-23 13:58:24,562 WARN
org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
hostname could be resolved for the IP address 10.32.160.6, using IP
address
as host name. Local input split assignment (such as for HDFS files) may
be
impacted.
2020-07-23 13:58:25,487 WARN
org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
hostname could be resolved for the IP address 10.38.64.7, using IP
address
as host name. Local input split assignment (such as for HDFS files) may
be
impacted.
2020-07-23 13:58:27,636 WARN
org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
hostname could be resolved for the IP address 10.42.160.6, using IP
address
as host name. Local input split assignment (such as for HDFS files) may
be
impacted.
2020-07-23 13:58:27,767 WARN
org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
hostname could be resolved for the IP address 10.43.64.12, using IP
address
as host name. Local input split assignment (such as for HDFS files) may
be
impacted.
2020-07-23 13:58:29,651 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
The heartbeat of JobManager with id 456a18b6c404cb11a359718e16de1c6b
timed
out.
2020-07-23 13:58:29,651 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Disconnect job manager 00000000000000000000000000000000
@akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
99a030d0e3f428490a501c0132f27a56 from the resource manager.
2020-07-23 13:58:29,854 WARN
org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
hostname could be resolved for the IP address 10.39.0.8, using IP address
as host name. Local input split assignment (such as for HDFS files) may
be
impacted.
2020-07-23 13:58:33,623 WARN
org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
hostname could be resolved for the IP address 10.35.0.10, using IP
address
as host name. Local input split assignment (such as for HDFS files) may
be
impacted.
2020-07-23 13:58:35,756 WARN
org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
hostname could be resolved for the IP address 10.36.32.8, using IP
address
as host name. Local input split assignment (such as for HDFS files) may
be
impacted.
2020-07-23 13:58:36,694 WARN
org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
hostname could be resolved for the IP address 10.42.128.6, using IP
address
as host name. Local input split assignment (such as for HDFS files) may
be
impacted.


2020-07-23 14:01:17,814 INFO
org.apache.flink.runtime.jobmaster.JobMaster
[] - Close ResourceManager connection
83b1ff14900abfd54418e7fa3efb3f8a: The heartbeat of JobManager with id
456a18b6c404cb11a359718e16de1c6b timed out..
2020-07-23 14:01:17,815 INFO
org.apache.flink.runtime.jobmaster.JobMaster
[] - Connecting to ResourceManager
akka.tcp://flink@flink-jobmanager
:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000)
2020-07-23 14:01:17,816 INFO
org.apache.flink.runtime.jobmaster.JobMaster
[] - Resolved ResourceManager address, beginning
registration
2020-07-23 14:01:17,816 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Registering job manager 00000000000000000000000000000000
@akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
99a030d0e3f428490a501c0132f27a56.
2020-07-23 14:01:17,836 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
Source:
host_relation -> Timestamps/Watermarks -> Map (1/1)
(302ca9640e2d209a543d843f2996ccd2) switched from SCHEDULED to FAILED on
not
deployed.

org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
Could not allocate the required slot within slot request timeout. Please
make sure that the cluster has enough resources.
at

org.apache.flink.runtime.scheduler.DefaultScheduler.maybeWrapWithNoResourceAvailableException(DefaultScheduler.java:441)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$assignResourceOrHandleError$6(DefaultScheduler.java:422)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
~[?:1.8.0_242]
at

org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:168)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
~[?:1.8.0_242]
at

org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:726)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:537)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:432)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
~[?:1.8.0_242]
at

org.apache.flink.runtime.concurrent.FutureUtils.lambda$forwardTo$21(FutureUtils.java:1120)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
~[?:1.8.0_242]
at

org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1036)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at
scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at
akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at

akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at
akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at

akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[flink-dist_2.11-1.11.1.jar:1.11.1]
Caused by: java.util.concurrent.CompletionException:
java.util.concurrent.TimeoutException
at

java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
~[?:1.8.0_242]
... 25 more
Caused by: java.util.concurrent.TimeoutException
... 23 more
2020-07-23 14:01:17,848 INFO

org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
[] - Calculating tasks to restart to recover the failed task
cbc357ccb763df2852fee8c4fc7d55f2_0.
2020-07-23 14:01:17,910 INFO

org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
[] - 902 tasks should be restarted to recover the failed task
cbc357ccb763df2852fee8c4fc7d55f2_0.
2020-07-23 14:01:17,913 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job
JobTest (99a030d0e3f428490a501c0132f27a56) switched from state RUNNING to
FAILING.
org.apache.flink.runtime.JobException: Recovery is suppressed by
NoRestartBackoffTimeStrategy
at

org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:116)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:78)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:192)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:185)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:179)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:503)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyTaskFailure(UpdateSchedulerNgOnInternalFailuresListener.java:49)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.executiongraph.ExecutionGraph.notifySchedulerNgAboutInternalTaskFailure(ExecutionGraph.java:1710)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1287)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1255)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.executiongraph.Execution.markFailed(Execution.java:1086)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.executiongraph.ExecutionVertex.markFailed(ExecutionVertex.java:748)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.scheduler.DefaultExecutionVertexOperations.markFailed(DefaultExecutionVertexOperations.java:41)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskDeploymentFailure(DefaultScheduler.java:435)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$assignResourceOrHandleError$6(DefaultScheduler.java:422)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
~[?:1.8.0_242]
at

org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:168)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
~[?:1.8.0_242]
at

org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:726)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:537)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:432)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
~[?:1.8.0_242]
at

org.apache.flink.runtime.concurrent.FutureUtils.lambda$forwardTo$21(FutureUtils.java:1120)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
~[?:1.8.0_242]
at

org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1036)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at
scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at
akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at

akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at
akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at

akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[flink-dist_2.11-1.11.1.jar:1.11.1]
Caused by:

org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
Could not allocate the required slot within slot request timeout. Please
make sure that the cluster has enough resources.
at

org.apache.flink.runtime.scheduler.DefaultScheduler.maybeWrapWithNoResourceAvailableException(DefaultScheduler.java:441)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
... 45 more
Caused by: java.util.concurrent.CompletionException:
java.util.concurrent.TimeoutException
at

java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
~[?:1.8.0_242]
... 25 more
Caused by: java.util.concurrent.TimeoutException
... 23 more



2020-07-23 14:01:18,109 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
Discarding the results produced by task execution
1809eb912d69854f2babedeaf879df6a.
2020-07-23 14:01:18,110 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job
JobTest (99a030d0e3f428490a501c0132f27a56) switched from state FAILING to
FAILED.
org.apache.flink.runtime.JobException: Recovery is suppressed by
NoRestartBackoffTimeStrategy
at

org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:116)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:78)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:192)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:185)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:179)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:503)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyTaskFailure(UpdateSchedulerNgOnInternalFailuresListener.java:49)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.executiongraph.ExecutionGraph.notifySchedulerNgAboutInternalTaskFailure(ExecutionGraph.java:1710)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1287)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1255)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.executiongraph.Execution.markFailed(Execution.java:1086)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.executiongraph.ExecutionVertex.markFailed(ExecutionVertex.java:748)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.scheduler.DefaultExecutionVertexOperations.markFailed(DefaultExecutionVertexOperations.java:41)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskDeploymentFailure(DefaultScheduler.java:435)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$assignResourceOrHandleError$6(DefaultScheduler.java:422)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
~[?:1.8.0_242]
at

org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:168)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
~[?:1.8.0_242]
at

org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:726)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:537)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:432)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
~[?:1.8.0_242]
at

org.apache.flink.runtime.concurrent.FutureUtils.lambda$forwardTo$21(FutureUtils.java:1120)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
~[?:1.8.0_242]
at

org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1036)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at
scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at
akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at

akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at
akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at

akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[flink-dist_2.11-1.11.1.jar:1.11.1]
Caused by:

org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
Could not allocate the required slot within slot request timeout. Please
make sure that the cluster has enough resources.
at

org.apache.flink.runtime.scheduler.DefaultScheduler.maybeWrapWithNoResourceAvailableException(DefaultScheduler.java:441)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
... 45 more
Caused by: java.util.concurrent.CompletionException:
java.util.concurrent.TimeoutException
at

java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
~[?:1.8.0_242]
... 25 more
Caused by: java.util.concurrent.TimeoutException
... 23 more
2020-07-23 14:01:18,114 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] -
Stopping
checkpoint coordinator for job 99a030d0e3f428490a501c0132f27a56.
2020-07-23 14:01:18,117 INFO
org.apache.flink.runtime.checkpoint.StandaloneCompletedCheckpointStore
[]
- Shutting down
2020-07-23 14:01:18,118 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
Discarding the results produced by task execution
302ca9640e2d209a543d843f2996ccd2.
2020-07-23 14:01:18,120 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Pending
slot request [SlotRequestId{15fd2a9565c2b080748c1d1592b1cbbc}] timed out.
2020-07-23 14:01:18,120 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Pending
slot request [SlotRequestId{8cd72cc16f0e319d915a9a096a1096d7}] timed out.
2020-07-23 14:01:18,120 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Pending
slot request [SlotRequestId{e7e422409acebdb385014a9634af6a90}] timed out.
2020-07-23 14:01:18,121 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Pending
slot request [SlotRequestId{cef1af73546ca1fc27ca7a3322e9e815}] timed out.
2020-07-23 14:01:18,121 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Pending
slot request [SlotRequestId{108fe0b3086567ad79275eccef2fdaf8}] timed out.
2020-07-23 14:01:18,121 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Pending
slot request [SlotRequestId{265e67985eab7a6dc08024e53bf2708d}] timed out.
2020-07-23 14:01:18,122 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Pending
slot request [SlotRequestId{7087497a17c441f1a1d6fefcbc7cd0ea}] timed out.
2020-07-23 14:01:18,122 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Pending
slot request [


2020-07-23 14:01:18,151 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Registering job manager 00000000000000000000000000000000
@akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
99a030d0e3f428490a501c0132f27a56.
2020-07-23 14:01:18,157 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Registered job manager 00000000000000000000000000000000
@akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
99a030d0e3f428490a501c0132f27a56.
2020-07-23 14:01:18,157 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Registered job manager 00000000000000000000000000000000
@akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
99a030d0e3f428490a501c0132f27a56.
2020-07-23 14:01:18,157 INFO
org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Job
99a030d0e3f428490a501c0132f27a56 reached globally terminal state FAILED.
2020-07-23 14:01:18,162 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Registered job manager 00000000000000000000000000000000
@akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
99a030d0e3f428490a501c0132f27a56.
2020-07-23 14:01:18,162 INFO
org.apache.flink.runtime.jobmaster.JobMaster
[] - JobManager successfully registered at
ResourceManager,
leader id: 00000000000000000000000000000000.
2020-07-23 14:01:18,225 INFO
org.apache.flink.runtime.jobmaster.JobMaster
[] - Stopping the JobMaster for job
JobTest(99a030d0e3f428490a501c0132f27a56).
2020-07-23 14:01:18,381 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Suspending SlotPool.
2020-07-23 14:01:18,382 INFO
org.apache.flink.runtime.jobmaster.JobMaster
[] - Close ResourceManager connection
83b1ff14900abfd54418e7fa3efb3f8a: JobManager is shutting down..
2020-07-23 14:01:18,382 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Stopping
SlotPool.
2020-07-23 14:01:18,382 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Disconnect job manager 00000000000000000000000000000000
@akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
99a030d0e3f428490a501c0132f27a56 from the resource manager.

a511955993
邮箱:[hidden email]

<
https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=a511955993&uid=a511955993%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22%E9%82%AE%E7%AE%B1%EF%BC%9Aa511955993%40163.com%22%5D&gt
;

签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail88&gt; 定制

On 07/23/2020 13:26, Yang Wang <[hidden email]> wrote:
很高兴你的问题解决了,但我觉得根本原因应该不是加上了taskmanager-query-state-service.yaml的关系。
我这边不创建这个服务也是正常的,而且nslookup {tm_ip_address}是可以正常反解析到hostname的。

注意这里不是解析hostname,而是通过ip地址来反解析进行验证


回答你说的两个问题:
1. 不是必须的,我这边验证不需要创建,集群也是可以正常运行任务的。Rest
service的暴露方式是ClusterIP、NodePort、LoadBalancer都正常
2. 如果没有配置taskmanager.bind-host,
[Flink-15911][Flink-15154]这两个JIRA并不会影响TM向RM注册时候的使用的地址

如果你想找到根本原因,那可能需要你这边提供JM/TM的完整log,这样方便分析


Best,
Yang

SmileSmile <[hidden email]> 于2020年7月23日周四 上午11:30写道:


Hi Yang Wang

刚刚在测试环境测试了一下,taskManager没有办法nslookup出来,JM可以nslookup,这两者的差别在于是否有service。

解决方案:我这边给集群加上了taskmanager-query-state-service.yaml(按照官网上是可选服务)。就不会刷No
hostname could be resolved for ip
address,将NodePort改为ClusterIp,作业就可以成功提交,不会出现time out的问题了,问题得到了解决。


1. 如果按照上面的情况,那么这个配置文件是必须配置的?

2. 在1.11的更新中,发现有 [Flink-15911][Flink-15154]
支持分别配置用于本地监听绑定的网络接口和外部访问的地址和端口。是否是这块的改动,
需要JM去通过TM上报的ip反向解析出service?


Bset!


[1]


https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html

a511955993
邮箱:[hidden email]

<

https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=a511955993&uid=a511955993%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22%E9%82%AE%E7%AE%B1%EF%BC%9Aa511955993%40163.com%22%5D&gt
;


签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail88&gt; 定制

On 07/23/2020 10:11, Yang Wang <[hidden email]> wrote:
我的意思就是你在Flink任务运行的过程中,然后下面的命令在集群里面起一个busybox的pod,
在里面执行 nslookup {ip_address},看看是否能够正常解析到。如果不能应该就是coredns的
问题了

kubectl run -i -t busybox --image=busybox --restart=Never

你需要确认下集群的coredns pod是否正常,一般是部署在kube-system这个namespace下的



Best,
Yang


SmileSmile <[hidden email]> 于2020年7月22日周三 下午7:57写道:


Hi,Yang Wang!

很开心可以收到你的回复,你的回复帮助很大,让我知道了问题的方向。我再补充些信息,希望可以帮我进一步判断一下问题根源。

在JM报错的地方,No hostname could be resolved for ip address xxxxx
,报出来的ip是k8s分配给flink pod的内网ip,不是宿主机的ip。请问这个问题可能出在哪里呢

Best!


a511955993
邮箱:[hidden email]

<


https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=a511955993&uid=a511955993%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22%E9%82%AE%E7%AE%B1%EF%BC%9Aa511955993%40163.com%22%5D&gt
;



签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail88&gt; 定制

On 07/22/2020 18:18, Yang Wang <[hidden email]> wrote:
如果你的日志里面一直在刷No hostname could be resolved for the IP
address,应该是集群的coredns
有问题,由ip地址反查hostname查不到。你可以起一个busybox验证一下是不是这个ip就解析不了,有
可能是coredns有问题


Best,
Yang

Congxian Qiu <[hidden email]> 于2020年7月21日周二 下午7:29写道:

Hi
不确定 k8s 环境中能否看到 pod 的完整日志?类似 Yarn 的 NM 日志一样,如果有的话,可以尝试看一下这个 pod
的完整日志有没有什么发现
Best,
Congxian


SmileSmile <[hidden email]> 于2020年7月21日周二 下午3:19写道:

Hi,Congxian

因为是测试环境,没有配置HA,目前看到的信息,就是JM刷出来大量的no hostname could be
resolved,jm失联,作业提交失败。
将jm内存配置为10g也是一样的情况(jobmanager.memory.pprocesa.size:10240m)。

在同一个环境将版本回退到1.10没有出现该问题,也不会刷如上报错。


是否有其他排查思路?

Best!




| |
a511955993
|
|
邮箱:[hidden email]
|

签名由 网易邮箱大师 定制

On 07/16/2020 13:17, Congxian Qiu wrote:
Hi
如果没有异常,GC 情况也正常的话,或许可以看一下 pod 的相关日志,如果开启了 HA 也可以看一下 zk
的日志。之前遇到过一次在
Yarn
环境中类似的现象是由于其他原因导致的,通过看 NM 日志以及 zk 日志发现的原因。

Best,
Congxian


SmileSmile <[hidden email]> 于2020年7月15日周三 下午5:20写道:

Hi Roc

该现象在1.10.1版本没有,在1.11版本才出现。请问这个该如何查比较合适



| |
a511955993
|
|
邮箱:[hidden email]
|

签名由 网易邮箱大师 定制

On 07/15/2020 17:16, Roc Marshal wrote:
Hi,SmileSmile.
个人之前有遇到过 类似 的host解析问题,可以从k8s的pod节点网络映射角度排查一下。
希望这对你有帮助。


祝好。
Roc Marshal











在 2020-07-15 17:04:18,"SmileSmile" <[hidden email]> 写道:

Hi

使用版本Flink 1.11,部署方式 kubernetes session。 TM个数30个,每个TM 4个slot。
job
并行度120.提交作业的时候出现大量的No hostname could be resolved for the IP
address,JM
time
out,作业提交失败。web ui也会卡主无响应。

用wordCount,并行度只有1提交也会刷,no
hostname的日志会刷个几条,然后正常提交,如果并行度一上去,就会超时。


部分日志如下:

2020-07-15 16:58:46,460 WARN
org.apache.flink.runtime.taskmanager.TaskManagerLocation     []
-
No
hostname could be resolved for the IP address 10.32.160.7,
using
IP
address
as host name. Local input split assignment (such as for HDFS
files)
may
be
impacted.
2020-07-15 16:58:46,460 WARN
org.apache.flink.runtime.taskmanager.TaskManagerLocation     []
-
No
hostname could be resolved for the IP address 10.44.224.7,
using
IP
address
as host name. Local input split assignment (such as for HDFS
files)
may
be
impacted.
2020-07-15 16:58:46,461 WARN
org.apache.flink.runtime.taskmanager.TaskManagerLocation     []
-
No
hostname could be resolved for the IP address 10.40.32.9, using
IP
address
as host name. Local input split assignment (such as for HDFS
files)
may
be
impacted.

2020-07-15 16:59:10,236 INFO

org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
[] -
The
heartbeat of JobManager with id
69a0d460de468888a9f41c770d963c0a
timed
out.
2020-07-15 16:59:10,236 INFO

org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
[] -
Disconnect job manager 00000000000000000000000000000000
@akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2
for
job
e1554c737e37ed79688a15c746b6e9ef from the resource manager.


how to deal with ?


beset !

| |
a511955993
|
|
邮箱:[hidden email]
|

签名由 网易邮箱大师 定制










Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.11 submit job timed out

DONG, Weike
Hi,

我们也遇到了同样的问题,并行度增加后,JobManager 卡住的时间越来越长,直到所有的 TaskManager 都被迫超时了。目前来看和 GC
无关,网络这里嫌疑更大。

On Fri, Jul 31, 2020 at 7:55 PM Matt Wang <[hidden email]> wrote:

> 遇到了同样的问题,也是启动了 taskmanager-query-state-service.yaml
> 这个服务后,作业才能正常提交的,另外我是在本地装的 k8s 集群进行测试的,如果是 GC 的问题,启不启动 TM service 应该不会有影响的
>
>
> --
>
> Best,
> Matt Wang
>
>
> On 07/27/2020 15:01,Yang Wang<[hidden email]> wrote:
> 建议先配置heartbeat.timeout的值大一些,然后把gc log打出来
> 看看是不是经常发生fullGC,每次持续时间是多长,从你目前提供的log看,进程内JM->RM都会心跳超时
> 怀疑还是和GC有关的
>
> env.java.opts.jobmanager: -Xloggc:<LOG_DIR>/jobmanager-gc.log
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=2 -XX:GCLogFileSize=512M
>
>
> Best,
> Yang
>
> SmileSmile <[hidden email]> 于2020年7月27日周一 下午1:50写道:
>
> Hi,Yang Wang
>
> 因为日志太长了,删了一些重复的内容。
> 一开始怀疑过jm gc的问题,将jm的内存调整为10g也是一样的情况。
>
> Best
>
>
>
> | |
> a511955993
> |
> |
> 邮箱:[hidden email]
> |
>
> 签名由 网易邮箱大师 定制
>
> On 07/27/2020 11:36, Yang Wang wrote:
> 看你这个任务,失败的根本原因并不是“No hostname could be resolved
> ”,这个WARNING的原因可以单独讨论(如果在1.10里面不存在的话)。
> 你可以本地起一个Standalone的集群,也会有这样的WARNING,并不影响正常使用
>
>
> 失败的原因是slot 5分钟申请超时了,你给的日志里面2020-07-23 13:55:45,519到2020-07-23
> 13:58:18,037是空白的,没有进行省略吧?
> 这段时间按理应该是task开始deploy了。在日志里看到了JM->RM的心跳超时,同一个Pod里面的同一个进程通信也超时了
> 所以怀疑JM一直在FullGC,这个需要你确认一下
>
>
> Best,
> Yang
>
> SmileSmile <[hidden email]> 于2020年7月23日周四 下午2:43写道:
>
> Hi Yang Wang
>
> 先分享下我这边的环境版本
>
>
> kubernetes:1.17.4.   CNI: weave
>
>
> 1 2 3 是我的一些疑惑
>
> 4 是JM日志
>
>
> 1. 去掉taskmanager-query-state-service.yaml后确实不行  nslookup
>
> kubectl exec -it busybox2 -- /bin/sh
> / # nslookup 10.47.96.2
> Server:          10.96.0.10
> Address:     10.96.0.10:53
>
> ** server can't find 2.96.47.10.in-addr.arpa: NXDOMAIN
>
>
>
> 2. Flink1.11和Flink1.10
>
> detail subtasks taskmanagers xxx x 这行
>
>
> 1.11变成了172-20-0-50。1.10是flink-taskmanager-7b5d6958b6-sfzlk:36459。这块的改动是?(目前这个集群跑着1.10和1.11,1.10可以正常运行,如果coredns有问题,1.10版本的flink应该也有一样的情况吧?)
>
> 3. coredns是否特殊配置?
>
> 在容器中解析域名是正常的,只是反向解析没有service才会有问题。coredns是否有什么需要配置?
>
>
> 4. time out时候的JM日志如下:
>
>
>
> 2020-07-23 13:53:00,228 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> ResourceManager akka.tcp://flink@flink-jobmanager
> :6123/user/rpc/resourcemanager_0
> was granted leadership with fencing token
> 00000000000000000000000000000000
> 2020-07-23 13:53:00,232 INFO
> org.apache.flink.runtime.rpc.akka.AkkaRpcService             [] -
> Starting
> RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher
> at akka://flink/user/rpc/dispatcher_1 .
> 2020-07-23 13:53:00,233 INFO
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl []
> -
> Starting the SlotManager.
> 2020-07-23 13:53:03,472 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registering TaskManager with ResourceID 1f9ae0cd95a28943a73be26323588696
> (akka.tcp://flink@10.34.128.9:6122/user/rpc/taskmanager_0) at
> ResourceManager
> 2020-07-23 13:53:03,777 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registering TaskManager with ResourceID cac09e751264e61615329c20713a84b4
> (akka.tcp://flink@10.32.160.6:6122/user/rpc/taskmanager_0) at
> ResourceManager
> 2020-07-23 13:53:03,787 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registering TaskManager with ResourceID 93c72d01d09f9ae427c5fc980ed4c1e4
> (akka.tcp://flink@10.39.0.8:6122/user/rpc/taskmanager_0) at
> ResourceManager
> 2020-07-23 13:53:04,044 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registering TaskManager with ResourceID 8adf2f8e81b77a16d5418a9e252c61e2
> (akka.tcp://flink@10.38.64.7:6122/user/rpc/taskmanager_0) at
> ResourceManager
> 2020-07-23 13:53:04,099 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registering TaskManager with ResourceID 23e9d2358f6eb76b9ae718d879d4f330
> (akka.tcp://flink@10.42.160.6:6122/user/rpc/taskmanager_0) at
> ResourceManager
> 2020-07-23 13:53:04,146 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registering TaskManager with ResourceID 092f8dee299e32df13db3111662b61f8
> (akka.tcp://flink@10.33.192.14:6122/user/rpc/taskmanager_0) at
> ResourceManager
>
>
> 2020-07-23 13:55:44,220 INFO
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] -
> Received
> JobGraph submission 99a030d0e3f428490a501c0132f27a56 (JobTest).
> 2020-07-23 13:55:44,222 INFO
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] -
> Submitting job 99a030d0e3f428490a501c0132f27a56 (JobTest).
> 2020-07-23 13:55:44,251 INFO
> org.apache.flink.runtime.rpc.akka.AkkaRpcService             [] -
> Starting
> RPC endpoint for org.apache.flink.runtime.jobmaster.JobMaster at
> akka://flink/user/rpc/jobmanager_2 .
> 2020-07-23 13:55:44,260 INFO
> org.apache.flink.runtime.jobmaster.JobMaster
> [] - Initializing job JobTest
> (99a030d0e3f428490a501c0132f27a56).
> 2020-07-23 13:55:44,278 INFO
> org.apache.flink.runtime.jobmaster.JobMaster
> [] - Using restart back off time strategy
> NoRestartBackoffTimeStrategy for JobTest
> (99a030d0e3f428490a501c0132f27a56).
> 2020-07-23 13:55:44,319 INFO
> org.apache.flink.runtime.jobmaster.JobMaster
> [] - Running initialization on master for job JobTest
> (99a030d0e3f428490a501c0132f27a56).
> 2020-07-23 13:55:44,319 INFO
> org.apache.flink.runtime.jobmaster.JobMaster
> [] - Successfully ran initialization on master in 0 ms.
> 2020-07-23 13:55:44,428 INFO
> org.apache.flink.runtime.scheduler.adapter.DefaultExecutionTopology [] -
> Built 1 pipelined regions in 25 ms
> 2020-07-23 13:55:44,437 INFO
> org.apache.flink.runtime.jobmaster.JobMaster
> [] - Loading state backend via factory
> org.apache.flink.contrib.streaming.state.RocksDBStateBackendFactory
> 2020-07-23 13:55:44,456 INFO
> org.apache.flink.contrib.streaming.state.RocksDBStateBackend [] - Using
> predefined options: DEFAULT.
> 2020-07-23 13:55:44,457 INFO
> org.apache.flink.contrib.streaming.state.RocksDBStateBackend [] - Using
> default options factory:
> DefaultConfigurableOptionsFactory{configuredOptions={}}.
> 2020-07-23 13:55:44,466 WARN  org.apache.flink.runtime.util.HadoopUtils
> [] - Could not find Hadoop configuration via any of the
> supported methods (Flink configuration, environment variables).
> 2020-07-23 13:55:45,276 INFO
> org.apache.flink.runtime.jobmaster.JobMaster
> [] - Using failover strategy
>
>
> org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@72bd8533
> for JobTest (99a030d0e3f428490a501c0132f27a56).
> 2020-07-23 13:55:45,280 INFO
> org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl      [] -
> JobManager runner for job JobTest (99a030d0e3f428490a501c0132f27a56) was
> granted leadership with session id 00000000-0000-0000-0000-000000000000
> at
> akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2.
> 2020-07-23 13:55:45,286 INFO
> org.apache.flink.runtime.jobmaster.JobMaster
> [] - Starting scheduling with scheduling strategy
> [org.apache.flink.runtime.scheduler.strategy.EagerSchedulingStrategy]
>
>
>
> 2020-07-23 13:55:45,436 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> serve slot request, no ResourceManager connected. Adding as pending
> request
> [SlotRequestId{e092b12b96b0a98bbf057e71b9705c23}]
> 2020-07-23 13:55:45,436 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> serve slot request, no ResourceManager connected. Adding as pending
> request
> [SlotRequestId{4ad15f417716c9e07fca383990c0f52a}]
> 2020-07-23 13:55:45,436 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> serve slot request, no ResourceManager connected. Adding as pending
> request
> [SlotRequestId{345fdb427a893b7fc3f4f040f93445d2}]
> 2020-07-23 13:55:45,437 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> serve slot request, no ResourceManager connected. Adding as pending
> request
> [SlotRequestId{e559485ea7b0b7e17367816882538d90}]
> 2020-07-23 13:55:45,437 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> serve slot request, no ResourceManager connected. Adding as pending
> request
> [SlotRequestId{7be8f6c1aedb27b04e7feae68078685c}]
> 2020-07-23 13:55:45,437 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> serve slot request, no ResourceManager connected. Adding as pending
> request
> [SlotRequestId{582a86197884206652dff3aea2306bb3}]
> 2020-07-23 13:55:45,437 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> serve slot request, no ResourceManager connected. Adding as pending
> request
> [SlotRequestId{0cc24260eda3af299a0b321feefaf2cb}]
> 2020-07-23 13:55:45,437 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> serve slot request, no ResourceManager connected. Adding as pending
> request
> [SlotRequestId{240ca6f3d3b5ece6a98243ec8cadf616}]
> 2020-07-23 13:55:45,438 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> serve slot request, no ResourceManager connected. Adding as pending
> request
> [SlotRequestId{c35033d598a517acc108424bb9f809fb}]
> 2020-07-23 13:55:45,438 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> serve slot request, no ResourceManager connected. Adding as pending
> request
> [SlotRequestId{ad35013c3b532d4b4df1be62395ae0cf}]
> 2020-07-23 13:55:45,438 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
> serve slot request, no ResourceManager connected. Adding as pending
> request
> [SlotRequestId{c929bd5e8daf432d01fad1ece3daec1a}]
> 2020-07-23 13:55:45,487 INFO
> org.apache.flink.runtime.jobmaster.JobMaster
> [] - Connecting to ResourceManager
> akka.tcp://flink@flink-jobmanager
> :6123/user/rpc/resourcemanager_*(00000000000000000000000000000000)
> 2020-07-23 13:55:45,492 INFO
> org.apache.flink.runtime.jobmaster.JobMaster
> [] - Resolved ResourceManager address, beginning
> registration
> 2020-07-23 13:55:45,493 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registering job manager 00000000000000000000000000000000
> @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> 99a030d0e3f428490a501c0132f27a56.
> 2020-07-23 13:55:45,499 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registered job manager 00000000000000000000000000000000
> @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> 99a030d0e3f428490a501c0132f27a56.
> 2020-07-23 13:55:45,501 INFO
> org.apache.flink.runtime.jobmaster.JobMaster
> [] - JobManager successfully registered at
> ResourceManager,
> leader id: 00000000000000000000000000000000.
> 2020-07-23 13:55:45,501 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{15fd2a9565c2b080748c1d1592b1cbbc}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,502 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Request slot with profile ResourceProfile{UNKNOWN} for job
> 99a030d0e3f428490a501c0132f27a56 with allocation id
> d420d08bf2654d9ea76955c70db18b69.
> 2020-07-23 13:55:45,502 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{8cd72cc16f0e319d915a9a096a1096d7}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,503 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{e7e422409acebdb385014a9634af6a90}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,503 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{cef1af73546ca1fc27ca7a3322e9e815}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,503 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{108fe0b3086567ad79275eccef2fdaf8}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,503 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{265e67985eab7a6dc08024e53bf2708d}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,503 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{7087497a17c441f1a1d6fefcbc7cd0ea}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,503 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{14ac08438e79c8db8d25d93b99d62725}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
>
> 2020-07-23 13:55:45,514 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Request slot with profile ResourceProfile{UNKNOWN} for job
> 99a030d0e3f428490a501c0132f27a56 with allocation id
> fce526bbe3e1be91caa3e4b536b20e35.
> 2020-07-23 13:55:45,514 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{40c7abbb12514c405323b0569fb21647}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,514 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{a4985a9647b65b30a571258b45c8f2ce}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,515 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{c52a6eb2fa58050e71e7903590019fd1}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
>
> 2020-07-23 13:55:45,517 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Request slot with profile ResourceProfile{UNKNOWN} for job
> 99a030d0e3f428490a501c0132f27a56 with allocation id
> 18ac7ec802ebfcfed8c05ee9324a55a4.
>
> 2020-07-23 13:55:45,518 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Request slot with profile ResourceProfile{UNKNOWN} for job
> 99a030d0e3f428490a501c0132f27a56 with allocation id
> 7ec76cbe689eb418b63599e90ade19be.
> 2020-07-23 13:55:45,518 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{46d65692a8b5aad11b51f9a74a666a74}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,518 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{3670bb4f345eedf941cc18e477ba1e9d}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,518 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{4a12467d76b9e3df8bc3412c0be08e14}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,518 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{e092b12b96b0a98bbf057e71b9705c23}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,518 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{4ad15f417716c9e07fca383990c0f52a}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,518 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{345fdb427a893b7fc3f4f040f93445d2}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,519 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Requesting new slot [SlotRequestId{e559485ea7b0b7e17367816882538d90}] and
> profile ResourceProfile{UNKNOWN} from resource manager.
> 2020-07-23 13:55:45,519 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Request slot with profile ResourceProfile{UNKNOWN} for job
> 99a030d0e3f428490a501c0132f27a56 with allocation id
> b78837a29b4032924ac25be70ed21a3c.
>
>
> 2020-07-23 13:58:18,037 WARN
> org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> hostname could be resolved for the IP address 10.47.96.2, using IP
> address
> as host name. Local input split assignment (such as for HDFS files) may
> be
> impacted.
> 2020-07-23 13:58:22,192 WARN
> org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> hostname could be resolved for the IP address 10.34.64.14, using IP
> address
> as host name. Local input split assignment (such as for HDFS files) may
> be
> impacted.
> 2020-07-23 13:58:22,358 WARN
> org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> hostname could be resolved for the IP address 10.34.128.9, using IP
> address
> as host name. Local input split assignment (such as for HDFS files) may
> be
> impacted.
> 2020-07-23 13:58:24,562 WARN
> org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> hostname could be resolved for the IP address 10.32.160.6, using IP
> address
> as host name. Local input split assignment (such as for HDFS files) may
> be
> impacted.
> 2020-07-23 13:58:25,487 WARN
> org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> hostname could be resolved for the IP address 10.38.64.7, using IP
> address
> as host name. Local input split assignment (such as for HDFS files) may
> be
> impacted.
> 2020-07-23 13:58:27,636 WARN
> org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> hostname could be resolved for the IP address 10.42.160.6, using IP
> address
> as host name. Local input split assignment (such as for HDFS files) may
> be
> impacted.
> 2020-07-23 13:58:27,767 WARN
> org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> hostname could be resolved for the IP address 10.43.64.12, using IP
> address
> as host name. Local input split assignment (such as for HDFS files) may
> be
> impacted.
> 2020-07-23 13:58:29,651 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> The heartbeat of JobManager with id 456a18b6c404cb11a359718e16de1c6b
> timed
> out.
> 2020-07-23 13:58:29,651 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Disconnect job manager 00000000000000000000000000000000
> @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> 99a030d0e3f428490a501c0132f27a56 from the resource manager.
> 2020-07-23 13:58:29,854 WARN
> org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> hostname could be resolved for the IP address 10.39.0.8, using IP address
> as host name. Local input split assignment (such as for HDFS files) may
> be
> impacted.
> 2020-07-23 13:58:33,623 WARN
> org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> hostname could be resolved for the IP address 10.35.0.10, using IP
> address
> as host name. Local input split assignment (such as for HDFS files) may
> be
> impacted.
> 2020-07-23 13:58:35,756 WARN
> org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> hostname could be resolved for the IP address 10.36.32.8, using IP
> address
> as host name. Local input split assignment (such as for HDFS files) may
> be
> impacted.
> 2020-07-23 13:58:36,694 WARN
> org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
> hostname could be resolved for the IP address 10.42.128.6, using IP
> address
> as host name. Local input split assignment (such as for HDFS files) may
> be
> impacted.
>
>
> 2020-07-23 14:01:17,814 INFO
> org.apache.flink.runtime.jobmaster.JobMaster
> [] - Close ResourceManager connection
> 83b1ff14900abfd54418e7fa3efb3f8a: The heartbeat of JobManager with id
> 456a18b6c404cb11a359718e16de1c6b timed out..
> 2020-07-23 14:01:17,815 INFO
> org.apache.flink.runtime.jobmaster.JobMaster
> [] - Connecting to ResourceManager
> akka.tcp://flink@flink-jobmanager
> :6123/user/rpc/resourcemanager_*(00000000000000000000000000000000)
> 2020-07-23 14:01:17,816 INFO
> org.apache.flink.runtime.jobmaster.JobMaster
> [] - Resolved ResourceManager address, beginning
> registration
> 2020-07-23 14:01:17,816 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registering job manager 00000000000000000000000000000000
> @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> 99a030d0e3f428490a501c0132f27a56.
> 2020-07-23 14:01:17,836 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> Source:
> host_relation -> Timestamps/Watermarks -> Map (1/1)
> (302ca9640e2d209a543d843f2996ccd2) switched from SCHEDULED to FAILED on
> not
> deployed.
>
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> Could not allocate the required slot within slot request timeout. Please
> make sure that the cluster has enough resources.
> at
>
>
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeWrapWithNoResourceAvailableException(DefaultScheduler.java:441)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$assignResourceOrHandleError$6(DefaultScheduler.java:422)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> ~[?:1.8.0_242]
> at
>
>
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:168)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> ~[?:1.8.0_242]
> at
>
>
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:726)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:537)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:432)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> ~[?:1.8.0_242]
> at
>
>
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$forwardTo$21(FutureUtils.java:1120)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> ~[?:1.8.0_242]
> at
>
>
> org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1036)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at
> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at
> akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at akka.actor.ActorCell.invoke(ActorCell.scala:561)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at akka.dispatch.Mailbox.run(Mailbox.scala:225)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at
> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> Caused by: java.util.concurrent.CompletionException:
> java.util.concurrent.TimeoutException
> at
>
>
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
> ~[?:1.8.0_242]
> at
>
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
> ~[?:1.8.0_242]
> ... 25 more
> Caused by: java.util.concurrent.TimeoutException
> ... 23 more
> 2020-07-23 14:01:17,848 INFO
>
>
> org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
> [] - Calculating tasks to restart to recover the failed task
> cbc357ccb763df2852fee8c4fc7d55f2_0.
> 2020-07-23 14:01:17,910 INFO
>
>
> org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
> [] - 902 tasks should be restarted to recover the failed task
> cbc357ccb763df2852fee8c4fc7d55f2_0.
> 2020-07-23 14:01:17,913 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job
> JobTest (99a030d0e3f428490a501c0132f27a56) switched from state RUNNING to
> FAILING.
> org.apache.flink.runtime.JobException: Recovery is suppressed by
> NoRestartBackoffTimeStrategy
> at
>
>
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:116)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:78)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:192)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:185)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:179)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:503)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyTaskFailure(UpdateSchedulerNgOnInternalFailuresListener.java:49)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.executiongraph.ExecutionGraph.notifySchedulerNgAboutInternalTaskFailure(ExecutionGraph.java:1710)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1287)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1255)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.executiongraph.Execution.markFailed(Execution.java:1086)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.executiongraph.ExecutionVertex.markFailed(ExecutionVertex.java:748)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.scheduler.DefaultExecutionVertexOperations.markFailed(DefaultExecutionVertexOperations.java:41)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskDeploymentFailure(DefaultScheduler.java:435)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$assignResourceOrHandleError$6(DefaultScheduler.java:422)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> ~[?:1.8.0_242]
> at
>
>
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:168)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> ~[?:1.8.0_242]
> at
>
>
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:726)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:537)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:432)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> ~[?:1.8.0_242]
> at
>
>
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$forwardTo$21(FutureUtils.java:1120)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> ~[?:1.8.0_242]
> at
>
>
> org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1036)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at
> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at
> akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at akka.actor.ActorCell.invoke(ActorCell.scala:561)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at akka.dispatch.Mailbox.run(Mailbox.scala:225)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at
> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> Caused by:
>
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> Could not allocate the required slot within slot request timeout. Please
> make sure that the cluster has enough resources.
> at
>
>
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeWrapWithNoResourceAvailableException(DefaultScheduler.java:441)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> ... 45 more
> Caused by: java.util.concurrent.CompletionException:
> java.util.concurrent.TimeoutException
> at
>
>
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
> ~[?:1.8.0_242]
> at
>
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
> ~[?:1.8.0_242]
> ... 25 more
> Caused by: java.util.concurrent.TimeoutException
> ... 23 more
>
>
>
> 2020-07-23 14:01:18,109 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> Discarding the results produced by task execution
> 1809eb912d69854f2babedeaf879df6a.
> 2020-07-23 14:01:18,110 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job
> JobTest (99a030d0e3f428490a501c0132f27a56) switched from state FAILING to
> FAILED.
> org.apache.flink.runtime.JobException: Recovery is suppressed by
> NoRestartBackoffTimeStrategy
> at
>
>
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:116)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:78)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:192)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:185)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:179)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:503)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyTaskFailure(UpdateSchedulerNgOnInternalFailuresListener.java:49)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.executiongraph.ExecutionGraph.notifySchedulerNgAboutInternalTaskFailure(ExecutionGraph.java:1710)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1287)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1255)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.executiongraph.Execution.markFailed(Execution.java:1086)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.executiongraph.ExecutionVertex.markFailed(ExecutionVertex.java:748)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.scheduler.DefaultExecutionVertexOperations.markFailed(DefaultExecutionVertexOperations.java:41)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskDeploymentFailure(DefaultScheduler.java:435)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$assignResourceOrHandleError$6(DefaultScheduler.java:422)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> ~[?:1.8.0_242]
> at
>
>
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:168)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> ~[?:1.8.0_242]
> at
>
>
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:726)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:537)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:432)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> ~[?:1.8.0_242]
> at
>
>
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$forwardTo$21(FutureUtils.java:1120)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> ~[?:1.8.0_242]
> at
>
>
> org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1036)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at
> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at
> akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at akka.actor.ActorCell.invoke(ActorCell.scala:561)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at akka.dispatch.Mailbox.run(Mailbox.scala:225)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at
> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> at
>
>
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> [flink-dist_2.11-1.11.1.jar:1.11.1]
> Caused by:
>
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> Could not allocate the required slot within slot request timeout. Please
> make sure that the cluster has enough resources.
> at
>
>
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeWrapWithNoResourceAvailableException(DefaultScheduler.java:441)
> ~[flink-dist_2.11-1.11.1.jar:1.11.1]
> ... 45 more
> Caused by: java.util.concurrent.CompletionException:
> java.util.concurrent.TimeoutException
> at
>
>
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
> ~[?:1.8.0_242]
> at
>
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
> ~[?:1.8.0_242]
> at
>
>
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
> ~[?:1.8.0_242]
> ... 25 more
> Caused by: java.util.concurrent.TimeoutException
> ... 23 more
> 2020-07-23 14:01:18,114 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] -
> Stopping
> checkpoint coordinator for job 99a030d0e3f428490a501c0132f27a56.
> 2020-07-23 14:01:18,117 INFO
> org.apache.flink.runtime.checkpoint.StandaloneCompletedCheckpointStore
> []
> - Shutting down
> 2020-07-23 14:01:18,118 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
> Discarding the results produced by task execution
> 302ca9640e2d209a543d843f2996ccd2.
> 2020-07-23 14:01:18,120 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Pending
> slot request [SlotRequestId{15fd2a9565c2b080748c1d1592b1cbbc}] timed out.
> 2020-07-23 14:01:18,120 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Pending
> slot request [SlotRequestId{8cd72cc16f0e319d915a9a096a1096d7}] timed out.
> 2020-07-23 14:01:18,120 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Pending
> slot request [SlotRequestId{e7e422409acebdb385014a9634af6a90}] timed out.
> 2020-07-23 14:01:18,121 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Pending
> slot request [SlotRequestId{cef1af73546ca1fc27ca7a3322e9e815}] timed out.
> 2020-07-23 14:01:18,121 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Pending
> slot request [SlotRequestId{108fe0b3086567ad79275eccef2fdaf8}] timed out.
> 2020-07-23 14:01:18,121 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Pending
> slot request [SlotRequestId{265e67985eab7a6dc08024e53bf2708d}] timed out.
> 2020-07-23 14:01:18,122 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Pending
> slot request [SlotRequestId{7087497a17c441f1a1d6fefcbc7cd0ea}] timed out.
> 2020-07-23 14:01:18,122 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Pending
> slot request [
>
>
> 2020-07-23 14:01:18,151 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registering job manager 00000000000000000000000000000000
> @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> 99a030d0e3f428490a501c0132f27a56.
> 2020-07-23 14:01:18,157 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registered job manager 00000000000000000000000000000000
> @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> 99a030d0e3f428490a501c0132f27a56.
> 2020-07-23 14:01:18,157 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registered job manager 00000000000000000000000000000000
> @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> 99a030d0e3f428490a501c0132f27a56.
> 2020-07-23 14:01:18,157 INFO
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Job
> 99a030d0e3f428490a501c0132f27a56 reached globally terminal state FAILED.
> 2020-07-23 14:01:18,162 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registered job manager 00000000000000000000000000000000
> @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> 99a030d0e3f428490a501c0132f27a56.
> 2020-07-23 14:01:18,162 INFO
> org.apache.flink.runtime.jobmaster.JobMaster
> [] - JobManager successfully registered at
> ResourceManager,
> leader id: 00000000000000000000000000000000.
> 2020-07-23 14:01:18,225 INFO
> org.apache.flink.runtime.jobmaster.JobMaster
> [] - Stopping the JobMaster for job
> JobTest(99a030d0e3f428490a501c0132f27a56).
> 2020-07-23 14:01:18,381 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Suspending SlotPool.
> 2020-07-23 14:01:18,382 INFO
> org.apache.flink.runtime.jobmaster.JobMaster
> [] - Close ResourceManager connection
> 83b1ff14900abfd54418e7fa3efb3f8a: JobManager is shutting down..
> 2020-07-23 14:01:18,382 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
> Stopping
> SlotPool.
> 2020-07-23 14:01:18,382 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Disconnect job manager 00000000000000000000000000000000
> @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
> 99a030d0e3f428490a501c0132f27a56 from the resource manager.
>
> a511955993
> 邮箱:[hidden email]
>
> <
>
> https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=a511955993&uid=a511955993%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22%E9%82%AE%E7%AE%B1%EF%BC%9Aa511955993%40163.com%22%5D&gt
> ;
>
> 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail88&gt; 定制
>
> On 07/23/2020 13:26, Yang Wang <[hidden email]> wrote:
> 很高兴你的问题解决了,但我觉得根本原因应该不是加上了taskmanager-query-state-service.yaml的关系。
> 我这边不创建这个服务也是正常的,而且nslookup {tm_ip_address}是可以正常反解析到hostname的。
>
> 注意这里不是解析hostname,而是通过ip地址来反解析进行验证
>
>
> 回答你说的两个问题:
> 1. 不是必须的,我这边验证不需要创建,集群也是可以正常运行任务的。Rest
> service的暴露方式是ClusterIP、NodePort、LoadBalancer都正常
> 2. 如果没有配置taskmanager.bind-host,
> [Flink-15911][Flink-15154]这两个JIRA并不会影响TM向RM注册时候的使用的地址
>
> 如果你想找到根本原因,那可能需要你这边提供JM/TM的完整log,这样方便分析
>
>
> Best,
> Yang
>
> SmileSmile <[hidden email]> 于2020年7月23日周四 上午11:30写道:
>
>
> Hi Yang Wang
>
> 刚刚在测试环境测试了一下,taskManager没有办法nslookup出来,JM可以nslookup,这两者的差别在于是否有service。
>
> 解决方案:我这边给集群加上了taskmanager-query-state-service.yaml(按照官网上是可选服务)。就不会刷No
> hostname could be resolved for ip
> address,将NodePort改为ClusterIp,作业就可以成功提交,不会出现time out的问题了,问题得到了解决。
>
>
> 1. 如果按照上面的情况,那么这个配置文件是必须配置的?
>
> 2. 在1.11的更新中,发现有 [Flink-15911][Flink-15154]
> 支持分别配置用于本地监听绑定的网络接口和外部访问的地址和端口。是否是这块的改动,
> 需要JM去通过TM上报的ip反向解析出service?
>
>
> Bset!
>
>
> [1]
>
>
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html
>
> a511955993
> 邮箱:[hidden email]
>
> <
>
>
> https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=a511955993&uid=a511955993%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22%E9%82%AE%E7%AE%B1%EF%BC%9Aa511955993%40163.com%22%5D&gt
> ;
>
>
> 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail88&gt; 定制
>
> On 07/23/2020 10:11, Yang Wang <[hidden email]> wrote:
> 我的意思就是你在Flink任务运行的过程中,然后下面的命令在集群里面起一个busybox的pod,
> 在里面执行 nslookup {ip_address},看看是否能够正常解析到。如果不能应该就是coredns的
> 问题了
>
> kubectl run -i -t busybox --image=busybox --restart=Never
>
> 你需要确认下集群的coredns pod是否正常,一般是部署在kube-system这个namespace下的
>
>
>
> Best,
> Yang
>
>
> SmileSmile <[hidden email]> 于2020年7月22日周三 下午7:57写道:
>
>
> Hi,Yang Wang!
>
> 很开心可以收到你的回复,你的回复帮助很大,让我知道了问题的方向。我再补充些信息,希望可以帮我进一步判断一下问题根源。
>
> 在JM报错的地方,No hostname could be resolved for ip address xxxxx
> ,报出来的ip是k8s分配给flink pod的内网ip,不是宿主机的ip。请问这个问题可能出在哪里呢
>
> Best!
>
>
> a511955993
> 邮箱:[hidden email]
>
> <
>
>
>
> https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=a511955993&uid=a511955993%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22%E9%82%AE%E7%AE%B1%EF%BC%9Aa511955993%40163.com%22%5D&gt
> ;
>
>
>
> 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail88&gt; 定制
>
> On 07/22/2020 18:18, Yang Wang <[hidden email]> wrote:
> 如果你的日志里面一直在刷No hostname could be resolved for the IP
> address,应该是集群的coredns
> 有问题,由ip地址反查hostname查不到。你可以起一个busybox验证一下是不是这个ip就解析不了,有
> 可能是coredns有问题
>
>
> Best,
> Yang
>
> Congxian Qiu <[hidden email]> 于2020年7月21日周二 下午7:29写道:
>
> Hi
> 不确定 k8s 环境中能否看到 pod 的完整日志?类似 Yarn 的 NM 日志一样,如果有的话,可以尝试看一下这个 pod
> 的完整日志有没有什么发现
> Best,
> Congxian
>
>
> SmileSmile <[hidden email]> 于2020年7月21日周二 下午3:19写道:
>
> Hi,Congxian
>
> 因为是测试环境,没有配置HA,目前看到的信息,就是JM刷出来大量的no hostname could be
> resolved,jm失联,作业提交失败。
> 将jm内存配置为10g也是一样的情况(jobmanager.memory.pprocesa.size:10240m)。
>
> 在同一个环境将版本回退到1.10没有出现该问题,也不会刷如上报错。
>
>
> 是否有其他排查思路?
>
> Best!
>
>
>
>
> | |
> a511955993
> |
> |
> 邮箱:[hidden email]
> |
>
> 签名由 网易邮箱大师 定制
>
> On 07/16/2020 13:17, Congxian Qiu wrote:
> Hi
> 如果没有异常,GC 情况也正常的话,或许可以看一下 pod 的相关日志,如果开启了 HA 也可以看一下 zk
> 的日志。之前遇到过一次在
> Yarn
> 环境中类似的现象是由于其他原因导致的,通过看 NM 日志以及 zk 日志发现的原因。
>
> Best,
> Congxian
>
>
> SmileSmile <[hidden email]> 于2020年7月15日周三 下午5:20写道:
>
> Hi Roc
>
> 该现象在1.10.1版本没有,在1.11版本才出现。请问这个该如何查比较合适
>
>
>
> | |
> a511955993
> |
> |
> 邮箱:[hidden email]
> |
>
> 签名由 网易邮箱大师 定制
>
> On 07/15/2020 17:16, Roc Marshal wrote:
> Hi,SmileSmile.
> 个人之前有遇到过 类似 的host解析问题,可以从k8s的pod节点网络映射角度排查一下。
> 希望这对你有帮助。
>
>
> 祝好。
> Roc Marshal
>
>
>
>
>
>
>
>
>
>
>
> 在 2020-07-15 17:04:18,"SmileSmile" <[hidden email]> 写道:
>
> Hi
>
> 使用版本Flink 1.11,部署方式 kubernetes session。 TM个数30个,每个TM 4个slot。
> job
> 并行度120.提交作业的时候出现大量的No hostname could be resolved for the IP
> address,JM
> time
> out,作业提交失败。web ui也会卡主无响应。
>
> 用wordCount,并行度只有1提交也会刷,no
> hostname的日志会刷个几条,然后正常提交,如果并行度一上去,就会超时。
>
>
> 部分日志如下:
>
> 2020-07-15 16:58:46,460 WARN
> org.apache.flink.runtime.taskmanager.TaskManagerLocation     []
> -
> No
> hostname could be resolved for the IP address 10.32.160.7,
> using
> IP
> address
> as host name. Local input split assignment (such as for HDFS
> files)
> may
> be
> impacted.
> 2020-07-15 16:58:46,460 WARN
> org.apache.flink.runtime.taskmanager.TaskManagerLocation     []
> -
> No
> hostname could be resolved for the IP address 10.44.224.7,
> using
> IP
> address
> as host name. Local input split assignment (such as for HDFS
> files)
> may
> be
> impacted.
> 2020-07-15 16:58:46,461 WARN
> org.apache.flink.runtime.taskmanager.TaskManagerLocation     []
> -
> No
> hostname could be resolved for the IP address 10.40.32.9, using
> IP
> address
> as host name. Local input split assignment (such as for HDFS
> files)
> may
> be
> impacted.
>
> 2020-07-15 16:59:10,236 INFO
>
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
> [] -
> The
> heartbeat of JobManager with id
> 69a0d460de468888a9f41c770d963c0a
> timed
> out.
> 2020-07-15 16:59:10,236 INFO
>
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
> [] -
> Disconnect job manager 00000000000000000000000000000000
> @akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2
> for
> job
> e1554c737e37ed79688a15c746b6e9ef from the resource manager.
>
>
> how to deal with ?
>
>
> beset !
>
> | |
> a511955993
> |
> |
> 邮箱:[hidden email]
> |
>
> 签名由 网易邮箱大师 定制
>
>
>
>
>
>
>
>
>
>
>