hi, all
集群信息: flink版本是1.10,部署在kubernetes上,kubernetes版本为1.17.4,docker版本为19.03, cni使用的是weave。 现象: 作业运行的时候,偶发会出现operation卡住,下游收不到数据,水位线无法更新,反压上游,作业在一段时间会被kill掉的情况。 通过jstack出来的堆栈信息片段如下: "Map (152/200)" #155 prio=5 os_prio=0 tid=0x00007f67a4076800 nid=0x31f waiting on condition [0x00007f66b04ed000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0000000608f3c600> (a java.util.concurrent.CompletableFuture$Signaller) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707) at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegmentBlocking(LocalBufferPool.java:231) at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:209) at org.apache.flink.runtime.io.network.partition.ResultPartition.getBufferBuilder(ResultPartition.java:189) at org.apache.flink.runtime.io.network.api.writer.ChannelSelectorRecordWriter.requestNewBufferBuilder(ChannelSelectorRecordWriter.java:103) at org.apache.flink.runtime.io.network.api.writer.RecordWriter.copyFromSerializerToTargetChannel(RecordWriter.java:145) at org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:116) at org.apache.flink.runtime.io.network.api.writer.ChannelSelectorRecordWriter.emit(ChannelSelectorRecordWriter.java:60) at org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:107) at org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:89) at org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:45) 有怀疑过是虚拟化网络问题,增加了如下参数,不见效: taskmanager.network.request-backoff.max: 300000 akka.ask.timeout: 120s akka.watch.heartbeat.interval: 10s 尝试过调整buffer数量,不见效: taskmanager.network.memory.floating-buffers-per-gate: 16 taskmanager.network.memory.buffers-per-channel: 6 目前我遭遇到的使用场景说明如上,希望得到一些回复和解答说明,非常感谢。 Looking forward to your reply and help. Best |
Hi ,
你可以看下你的内存配置情况,看看是不是内存配置太小,导致 networkd bufffers 不够。 具体文档参考: https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_detail.html Best, LakeShen a511955993 <[hidden email]> 于2020年5月7日周四 下午9:54写道: > hi, all > > > 集群信息: > flink版本是1.10,部署在kubernetes上,kubernetes版本为1.17.4,docker版本为19.03, > cni使用的是weave。 > > > 现象: > 作业运行的时候,偶发会出现operation卡住,下游收不到数据,水位线无法更新,反压上游,作业在一段时间会被kill掉的情况。 > > > 通过jstack出来的堆栈信息片段如下: > > > "Map (152/200)" #155 prio=5 os_prio=0 tid=0x00007f67a4076800 nid=0x31f > waiting on condition [0x00007f66b04ed000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0000000608f3c600> (a > java.util.concurrent.CompletableFuture$Signaller) > at > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707) > at > java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) > at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742) > at > java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) > at org.apache.flink.runtime.io > .network.buffer.LocalBufferPool.requestMemorySegmentBlocking(LocalBufferPool.java:231) > at org.apache.flink.runtime.io > .network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:209) > at org.apache.flink.runtime.io > .network.partition.ResultPartition.getBufferBuilder(ResultPartition.java:189) > at org.apache.flink.runtime.io > .network.api.writer.ChannelSelectorRecordWriter.requestNewBufferBuilder(ChannelSelectorRecordWriter.java:103) > at org.apache.flink.runtime.io > .network.api.writer.RecordWriter.copyFromSerializerToTargetChannel(RecordWriter.java:145) > at org.apache.flink.runtime.io > .network.api.writer.RecordWriter.emit(RecordWriter.java:116) > at org.apache.flink.runtime.io > .network.api.writer.ChannelSelectorRecordWriter.emit(ChannelSelectorRecordWriter.java:60) > at > org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:107) > at > org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:89) > at > org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:45) > > > > > 有怀疑过是虚拟化网络问题,增加了如下参数,不见效: > taskmanager.network.request-backoff.max: 300000 > akka.ask.timeout: 120s > akka.watch.heartbeat.interval: 10s > > > 尝试过调整buffer数量,不见效: > taskmanager.network.memory.floating-buffers-per-gate: 16 > taskmanager.network.memory.buffers-per-channel: 6 > > > > > 目前我遭遇到的使用场景说明如上,希望得到一些回复和解答说明,非常感谢。 > > Looking forward to your reply and help. > > Best > > > > |
taskmanager.network.memory.min:2g max:3g。 如果网络内存不足,集群不是应该在启动的时候就报错吗。是否会在运行期间才出现卡住的现象? | | a511955993 | | 邮箱:[hidden email] | 签名由 网易邮箱大师 定制 在2020年05月08日 10:08,LakeShen 写道: Hi , 你可以看下你的内存配置情况,看看是不是内存配置太小,导致 networkd bufffers 不够。 具体文档参考: https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_detail.html Best, LakeShen a511955993 <[hidden email]> 于2020年5月7日周四 下午9:54写道: > hi, all > > > 集群信息: > flink版本是1.10,部署在kubernetes上,kubernetes版本为1.17.4,docker版本为19.03, > cni使用的是weave。 > > > 现象: > 作业运行的时候,偶发会出现operation卡住,下游收不到数据,水位线无法更新,反压上游,作业在一段时间会被kill掉的情况。 > > > 通过jstack出来的堆栈信息片段如下: > > > "Map (152/200)" #155 prio=5 os_prio=0 tid=0x00007f67a4076800 nid=0x31f > waiting on condition [0x00007f66b04ed000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0000000608f3c600> (a > java.util.concurrent.CompletableFuture$Signaller) > at > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707) > at > java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) > at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742) > at > java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) > at org.apache.flink.runtime.io > .network.buffer.LocalBufferPool.requestMemorySegmentBlocking(LocalBufferPool.java:231) > at org.apache.flink.runtime.io > .network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:209) > at org.apache.flink.runtime.io > .network.partition.ResultPartition.getBufferBuilder(ResultPartition.java:189) > at org.apache.flink.runtime.io > .network.api.writer.ChannelSelectorRecordWriter.requestNewBufferBuilder(ChannelSelectorRecordWriter.java:103) > at org.apache.flink.runtime.io > .network.api.writer.RecordWriter.copyFromSerializerToTargetChannel(RecordWriter.java:145) > at org.apache.flink.runtime.io > .network.api.writer.RecordWriter.emit(RecordWriter.java:116) > at org.apache.flink.runtime.io > .network.api.writer.ChannelSelectorRecordWriter.emit(ChannelSelectorRecordWriter.java:60) > at > org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:107) > at > org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:89) > at > org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:45) > > > > > 有怀疑过是虚拟化网络问题,增加了如下参数,不见效: > taskmanager.network.request-backoff.max: 300000 > akka.ask.timeout: 120s > akka.watch.heartbeat.interval: 10s > > > 尝试过调整buffer数量,不见效: > taskmanager.network.memory.floating-buffers-per-gate: 16 > taskmanager.network.memory.buffers-per-channel: 6 > > > > > 目前我遭遇到的使用场景说明如上,希望得到一些回复和解答说明,非常感谢。 > > Looking forward to your reply and help. > > Best > > > > |
In reply to this post by LakeShen
兄弟 请问你的问题解决了吗?怎么解决的,谢谢
| | 邵红晓 | | 邮箱:[hidden email] | 签名由网易邮箱大师定制 在2020年5月8日 10:08,LakeShen<[hidden email]> 写道: Hi , 你可以看下你的内存配置情况,看看是不是内存配置太小,导致 networkd bufffers 不够。 具体文档参考: https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_detail.html Best, LakeShen a511955993 <[hidden email]> 于2020年5月7日周四 下午9:54写道: hi, all 集群信息: flink版本是1.10,部署在kubernetes上,kubernetes版本为1.17.4,docker版本为19.03, cni使用的是weave。 现象: 作业运行的时候,偶发会出现operation卡住,下游收不到数据,水位线无法更新,反压上游,作业在一段时间会被kill掉的情况。 通过jstack出来的堆栈信息片段如下: "Map (152/200)" #155 prio=5 os_prio=0 tid=0x00007f67a4076800 nid=0x31f waiting on condition [0x00007f66b04ed000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0000000608f3c600> (a java.util.concurrent.CompletableFuture$Signaller) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707) at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) at org.apache.flink.runtime.io .network.buffer.LocalBufferPool.requestMemorySegmentBlocking(LocalBufferPool.java:231) at org.apache.flink.runtime.io .network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:209) at org.apache.flink.runtime.io .network.partition.ResultPartition.getBufferBuilder(ResultPartition.java:189) at org.apache.flink.runtime.io .network.api.writer.ChannelSelectorRecordWriter.requestNewBufferBuilder(ChannelSelectorRecordWriter.java:103) at org.apache.flink.runtime.io .network.api.writer.RecordWriter.copyFromSerializerToTargetChannel(RecordWriter.java:145) at org.apache.flink.runtime.io .network.api.writer.RecordWriter.emit(RecordWriter.java:116) at org.apache.flink.runtime.io .network.api.writer.ChannelSelectorRecordWriter.emit(ChannelSelectorRecordWriter.java:60) at org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:107) at org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:89) at org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:45) 有怀疑过是虚拟化网络问题,增加了如下参数,不见效: taskmanager.network.request-backoff.max: 300000 akka.ask.timeout: 120s akka.watch.heartbeat.interval: 10s 尝试过调整buffer数量,不见效: taskmanager.network.memory.floating-buffers-per-gate: 16 taskmanager.network.memory.buffers-per-channel: 6 目前我遭遇到的使用场景说明如上,希望得到一些回复和解答说明,非常感谢。 Looking forward to your reply and help. Best |
Free forum by Nabble | Edit this page |