flink on kubernetes 作业卡主现象咨询

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

flink on kubernetes 作业卡主现象咨询

Chris Guo
hi, all


集群信息:
flink版本是1.10,部署在kubernetes上,kubernetes版本为1.17.4,docker版本为19.03, cni使用的是weave。


现象:
作业运行的时候,偶发会出现operation卡住,下游收不到数据,水位线无法更新,反压上游,作业在一段时间会被kill掉的情况。


通过jstack出来的堆栈信息片段如下:


"Map (152/200)" #155 prio=5 os_prio=0 tid=0x00007f67a4076800 nid=0x31f waiting on condition [0x00007f66b04ed000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x0000000608f3c600> (a java.util.concurrent.CompletableFuture$Signaller)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
        at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
        at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
        at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegmentBlocking(LocalBufferPool.java:231)
        at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:209)
        at org.apache.flink.runtime.io.network.partition.ResultPartition.getBufferBuilder(ResultPartition.java:189)
        at org.apache.flink.runtime.io.network.api.writer.ChannelSelectorRecordWriter.requestNewBufferBuilder(ChannelSelectorRecordWriter.java:103)
        at org.apache.flink.runtime.io.network.api.writer.RecordWriter.copyFromSerializerToTargetChannel(RecordWriter.java:145)
        at org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:116)
        at org.apache.flink.runtime.io.network.api.writer.ChannelSelectorRecordWriter.emit(ChannelSelectorRecordWriter.java:60)
        at org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:107)
        at org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:89)
        at org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:45)




有怀疑过是虚拟化网络问题,增加了如下参数,不见效:
taskmanager.network.request-backoff.max: 300000
akka.ask.timeout: 120s
akka.watch.heartbeat.interval: 10s


尝试过调整buffer数量,不见效:
taskmanager.network.memory.floating-buffers-per-gate: 16
taskmanager.network.memory.buffers-per-channel: 6




目前我遭遇到的使用场景说明如上,希望得到一些回复和解答说明,非常感谢。

Looking forward to your reply and help.

Best



Reply | Threaded
Open this post in threaded view
|

Re: flink on kubernetes 作业卡主现象咨询

LakeShen
Hi ,

你可以看下你的内存配置情况,看看是不是内存配置太小,导致 networkd bufffers 不够。

具体文档参考:
https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_detail.html

Best,
LakeShen

a511955993 <[hidden email]> 于2020年5月7日周四 下午9:54写道:

> hi, all
>
>
> 集群信息:
> flink版本是1.10,部署在kubernetes上,kubernetes版本为1.17.4,docker版本为19.03,
> cni使用的是weave。
>
>
> 现象:
> 作业运行的时候,偶发会出现operation卡住,下游收不到数据,水位线无法更新,反压上游,作业在一段时间会被kill掉的情况。
>
>
> 通过jstack出来的堆栈信息片段如下:
>
>
> "Map (152/200)" #155 prio=5 os_prio=0 tid=0x00007f67a4076800 nid=0x31f
> waiting on condition [0x00007f66b04ed000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x0000000608f3c600> (a
> java.util.concurrent.CompletableFuture$Signaller)
>         at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>         at
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
>         at
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
>         at
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
>         at
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
>         at org.apache.flink.runtime.io
> .network.buffer.LocalBufferPool.requestMemorySegmentBlocking(LocalBufferPool.java:231)
>         at org.apache.flink.runtime.io
> .network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:209)
>         at org.apache.flink.runtime.io
> .network.partition.ResultPartition.getBufferBuilder(ResultPartition.java:189)
>         at org.apache.flink.runtime.io
> .network.api.writer.ChannelSelectorRecordWriter.requestNewBufferBuilder(ChannelSelectorRecordWriter.java:103)
>         at org.apache.flink.runtime.io
> .network.api.writer.RecordWriter.copyFromSerializerToTargetChannel(RecordWriter.java:145)
>         at org.apache.flink.runtime.io
> .network.api.writer.RecordWriter.emit(RecordWriter.java:116)
>         at org.apache.flink.runtime.io
> .network.api.writer.ChannelSelectorRecordWriter.emit(ChannelSelectorRecordWriter.java:60)
>         at
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:107)
>         at
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:89)
>         at
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:45)
>
>
>
>
> 有怀疑过是虚拟化网络问题,增加了如下参数,不见效:
> taskmanager.network.request-backoff.max: 300000
> akka.ask.timeout: 120s
> akka.watch.heartbeat.interval: 10s
>
>
> 尝试过调整buffer数量,不见效:
> taskmanager.network.memory.floating-buffers-per-gate: 16
> taskmanager.network.memory.buffers-per-channel: 6
>
>
>
>
> 目前我遭遇到的使用场景说明如上,希望得到一些回复和解答说明,非常感谢。
>
> Looking forward to your reply and help.
>
> Best
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

回复:flink on kubernetes 作业卡主现象咨询

Chris Guo

taskmanager.network.memory.min:2g max:3g。

如果网络内存不足,集群不是应该在启动的时候就报错吗。是否会在运行期间才出现卡住的现象?



| |
a511955993
|
|
邮箱:[hidden email]
|

签名由 网易邮箱大师 定制

在2020年05月08日 10:08,LakeShen 写道:
Hi ,

你可以看下你的内存配置情况,看看是不是内存配置太小,导致 networkd bufffers 不够。

具体文档参考:
https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_detail.html

Best,
LakeShen

a511955993 <[hidden email]> 于2020年5月7日周四 下午9:54写道:

> hi, all
>
>
> 集群信息:
> flink版本是1.10,部署在kubernetes上,kubernetes版本为1.17.4,docker版本为19.03,
> cni使用的是weave。
>
>
> 现象:
> 作业运行的时候,偶发会出现operation卡住,下游收不到数据,水位线无法更新,反压上游,作业在一段时间会被kill掉的情况。
>
>
> 通过jstack出来的堆栈信息片段如下:
>
>
> "Map (152/200)" #155 prio=5 os_prio=0 tid=0x00007f67a4076800 nid=0x31f
> waiting on condition [0x00007f66b04ed000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x0000000608f3c600> (a
> java.util.concurrent.CompletableFuture$Signaller)
>         at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>         at
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
>         at
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
>         at
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
>         at
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
>         at org.apache.flink.runtime.io
> .network.buffer.LocalBufferPool.requestMemorySegmentBlocking(LocalBufferPool.java:231)
>         at org.apache.flink.runtime.io
> .network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:209)
>         at org.apache.flink.runtime.io
> .network.partition.ResultPartition.getBufferBuilder(ResultPartition.java:189)
>         at org.apache.flink.runtime.io
> .network.api.writer.ChannelSelectorRecordWriter.requestNewBufferBuilder(ChannelSelectorRecordWriter.java:103)
>         at org.apache.flink.runtime.io
> .network.api.writer.RecordWriter.copyFromSerializerToTargetChannel(RecordWriter.java:145)
>         at org.apache.flink.runtime.io
> .network.api.writer.RecordWriter.emit(RecordWriter.java:116)
>         at org.apache.flink.runtime.io
> .network.api.writer.ChannelSelectorRecordWriter.emit(ChannelSelectorRecordWriter.java:60)
>         at
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:107)
>         at
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:89)
>         at
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:45)
>
>
>
>
> 有怀疑过是虚拟化网络问题,增加了如下参数,不见效:
> taskmanager.network.request-backoff.max: 300000
> akka.ask.timeout: 120s
> akka.watch.heartbeat.interval: 10s
>
>
> 尝试过调整buffer数量,不见效:
> taskmanager.network.memory.floating-buffers-per-gate: 16
> taskmanager.network.memory.buffers-per-channel: 6
>
>
>
>
> 目前我遭遇到的使用场景说明如上,希望得到一些回复和解答说明,非常感谢。
>
> Looking forward to your reply and help.
>
> Best
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

回复: flink on kubernetes 作业卡主现象咨询

shao.hongxiao
In reply to this post by LakeShen
兄弟 请问你的问题解决了吗?怎么解决的,谢谢


| |
邵红晓
|
|
邮箱:[hidden email]
|
签名由网易邮箱大师定制
在2020年5月8日 10:08,LakeShen<[hidden email]> 写道:
Hi ,

你可以看下你的内存配置情况,看看是不是内存配置太小,导致 networkd bufffers 不够。

具体文档参考:
https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_detail.html

Best,
LakeShen

a511955993 <[hidden email]> 于2020年5月7日周四 下午9:54写道:

hi, all


集群信息:
flink版本是1.10,部署在kubernetes上,kubernetes版本为1.17.4,docker版本为19.03,
cni使用的是weave。


现象:
作业运行的时候,偶发会出现operation卡住,下游收不到数据,水位线无法更新,反压上游,作业在一段时间会被kill掉的情况。


通过jstack出来的堆栈信息片段如下:


"Map (152/200)" #155 prio=5 os_prio=0 tid=0x00007f67a4076800 nid=0x31f
waiting on condition [0x00007f66b04ed000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x0000000608f3c600> (a
java.util.concurrent.CompletableFuture$Signaller)
at
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
at
java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
at
java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
at
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
at org.apache.flink.runtime.io
.network.buffer.LocalBufferPool.requestMemorySegmentBlocking(LocalBufferPool.java:231)
at org.apache.flink.runtime.io
.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:209)
at org.apache.flink.runtime.io
.network.partition.ResultPartition.getBufferBuilder(ResultPartition.java:189)
at org.apache.flink.runtime.io
.network.api.writer.ChannelSelectorRecordWriter.requestNewBufferBuilder(ChannelSelectorRecordWriter.java:103)
at org.apache.flink.runtime.io
.network.api.writer.RecordWriter.copyFromSerializerToTargetChannel(RecordWriter.java:145)
at org.apache.flink.runtime.io
.network.api.writer.RecordWriter.emit(RecordWriter.java:116)
at org.apache.flink.runtime.io
.network.api.writer.ChannelSelectorRecordWriter.emit(ChannelSelectorRecordWriter.java:60)
at
org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:107)
at
org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:89)
at
org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:45)




有怀疑过是虚拟化网络问题,增加了如下参数,不见效:
taskmanager.network.request-backoff.max: 300000
akka.ask.timeout: 120s
akka.watch.heartbeat.interval: 10s


尝试过调整buffer数量,不见效:
taskmanager.network.memory.floating-buffers-per-gate: 16
taskmanager.network.memory.buffers-per-channel: 6




目前我遭遇到的使用场景说明如上,希望得到一些回复和解答说明,非常感谢。

Looking forward to your reply and help.

Best