Map反压了,要看下游什么操作卡住了,比如写es很慢
------------------ 原始邮件 ------------------ 发件人: "a511955993"<[hidden email]>; 发送时间: 2020年5月7日(星期四) 晚上9:51 收件人: "user-zh"<[hidden email]>; 主题: flink on kubernetes 作业卡主现象咨询 hi, all 集群信息: flink版本是1.10,部署在kubernetes上,kubernetes版本为1.17.4,docker版本为19.03, cni使用的是weave。 现象: 作业运行的时候,偶发会出现operation卡住,下游收不到数据,水位线无法更新,反压上游,作业在一段时间会被kill掉的情况。 通过jstack出来的堆栈信息片段如下: "Map (152/200)" #155 prio=5 os_prio=0 tid=0x00007f67a4076800 nid=0x31f waiting on condition [0x00007f66b04ed000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0000000608f3c600> (a java.util.concurrent.CompletableFuture$Signaller) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707) at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegmentBlocking(LocalBufferPool.java:231) at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:209) at org.apache.flink.runtime.io.network.partition.ResultPartition.getBufferBuilder(ResultPartition.java:189) at org.apache.flink.runtime.io.network.api.writer.ChannelSelectorRecordWriter.requestNewBufferBuilder(ChannelSelectorRecordWriter.java:103) at org.apache.flink.runtime.io.network.api.writer.RecordWriter.copyFromSerializerToTargetChannel(RecordWriter.java:145) at org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:116) at org.apache.flink.runtime.io.network.api.writer.ChannelSelectorRecordWriter.emit(ChannelSelectorRecordWriter.java:60) at org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:107) at org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:89) at org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:45) 有怀疑过是虚拟化网络问题,增加了如下参数,不见效: taskmanager.network.request-backoff.max: 300000 akka.ask.timeout: 120s akka.watch.heartbeat.interval: 10s 尝试过调整buffer数量,不见效: taskmanager.network.memory.floating-buffers-per-gate: 16 taskmanager.network.memory.buffers-per-channel: 6 目前我遭遇到的使用场景说明如上,希望得到一些回复和解答说明,非常感谢。 Looking forward to your reply and help. Best |
这个map其实是union操作,然后下游是keyby agg ,数据发送到kafka,kafka层面没有问题。 | | a511955993 | | 邮箱:[hidden email] | 签名由 网易邮箱大师 定制 在2020年05月08日 10:06,蒋佳成(Jiacheng Jiang) 写道: Map反压了,要看下游什么操作卡住了,比如写es很慢 ------------------ 原始邮件 ------------------ 发件人: "a511955993"<[hidden email]>; 发送时间: 2020年5月7日(星期四) 晚上9:51 收件人: "user-zh"<[hidden email]>; 主题: flink on kubernetes 作业卡主现象咨询 hi, all 集群信息: flink版本是1.10,部署在kubernetes上,kubernetes版本为1.17.4,docker版本为19.03, cni使用的是weave。 现象: 作业运行的时候,偶发会出现operation卡住,下游收不到数据,水位线无法更新,反压上游,作业在一段时间会被kill掉的情况。 通过jstack出来的堆栈信息片段如下: "Map (152/200)" #155 prio=5 os_prio=0 tid=0x00007f67a4076800 nid=0x31f waiting on condition [0x00007f66b04ed000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0000000608f3c600> (a java.util.concurrent.CompletableFuture$Signaller) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707) at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegmentBlocking(LocalBufferPool.java:231) at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:209) at org.apache.flink.runtime.io.network.partition.ResultPartition.getBufferBuilder(ResultPartition.java:189) at org.apache.flink.runtime.io.network.api.writer.ChannelSelectorRecordWriter.requestNewBufferBuilder(ChannelSelectorRecordWriter.java:103) at org.apache.flink.runtime.io.network.api.writer.RecordWriter.copyFromSerializerToTargetChannel(RecordWriter.java:145) at org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:116) at org.apache.flink.runtime.io.network.api.writer.ChannelSelectorRecordWriter.emit(ChannelSelectorRecordWriter.java:60) at org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:107) at org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:89) at org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:45) 有怀疑过是虚拟化网络问题,增加了如下参数,不见效: taskmanager.network.request-backoff.max: 300000 akka.ask.timeout: 120s akka.watch.heartbeat.interval: 10s 尝试过调整buffer数量,不见效: taskmanager.network.memory.floating-buffers-per-gate: 16 taskmanager.network.memory.buffers-per-channel: 6 目前我遭遇到的使用场景说明如上,希望得到一些回复和解答说明,非常感谢。 Looking forward to your reply and help. Best |
Free forum by Nabble | Edit this page |