兄弟 请问你的问题解决了吗?怎么解决的,谢谢
| | 邵红晓 | | 邮箱:[email protected] | 签名由网易邮箱大师定制 在2020年5月8日 10:08,LakeShen<[email protected]> 写道: Hi , 你可以看下你的内存配置情况,看看是不是内存配置太小,导致 networkd bufffers 不够。 具体文档参考: https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_detail.html Best, LakeShen a511955993 <[email protected]> 于2020年5月7日周四 下午9:54写道: hi, all 集群信息: flink版本是1.10,部署在kubernetes上,kubernetes版本为1.17.4,docker版本为19.03, cni使用的是weave。 现象: 作业运行的时候,偶发会出现operation卡住,下游收不到数据,水位线无法更新,反压上游,作业在一段时间会被kill掉的情况。 通过jstack出来的堆栈信息片段如下: "Map (152/200)" #155 prio=5 os_prio=0 tid=0x00007f67a4076800 nid=0x31f waiting on condition [0x00007f66b04ed000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0000000608f3c600> (a java.util.concurrent.CompletableFuture$Signaller) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707) at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) at org.apache.flink.runtime.io .network.buffer.LocalBufferPool.requestMemorySegmentBlocking(LocalBufferPool.java:231) at org.apache.flink.runtime.io .network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:209) at org.apache.flink.runtime.io .network.partition.ResultPartition.getBufferBuilder(ResultPartition.java:189) at org.apache.flink.runtime.io .network.api.writer.ChannelSelectorRecordWriter.requestNewBufferBuilder(ChannelSelectorRecordWriter.java:103) at org.apache.flink.runtime.io .network.api.writer.RecordWriter.copyFromSerializerToTargetChannel(RecordWriter.java:145) at org.apache.flink.runtime.io .network.api.writer.RecordWriter.emit(RecordWriter.java:116) at org.apache.flink.runtime.io .network.api.writer.ChannelSelectorRecordWriter.emit(ChannelSelectorRecordWriter.java:60) at org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:107) at org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:89) at org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:45) 有怀疑过是虚拟化网络问题,增加了如下参数,不见效: taskmanager.network.request-backoff.max: 300000 akka.ask.timeout: 120s akka.watch.heartbeat.interval: 10s 尝试过调整buffer数量,不见效: taskmanager.network.memory.floating-buffers-per-gate: 16 taskmanager.network.memory.buffers-per-channel: 6 目前我遭遇到的使用场景说明如上,希望得到一些回复和解答说明,非常感谢。 Looking forward to your reply and help. Best
