Hi, Are you sure that your growing memory came from DirectByteBuffer? What about metaspace? Flink 1.9 may have some metaspace leak after a full restart or fine-grained restart, see [1] and [2] for more details. And if you didn't set a max metaspace by -XX:MaxMetaspaceSize, it will grow indefinitely and finally cause an OOM kill.
[1]. https://issues.apache.org/jira/browse/FLINK-16225 [2]. https://ci.apache.org/projects/flink/flink-docs-release-1.9/monitoring/debugging_classloading.html#unloading-of-dynamically-loaded-classes-in-user-code Regards Smile On 2021/07/15 18:22:56, bat man <tintin0...@gmail.com> wrote: > I am not using the Kafka SSL port. > > On Thu, Jul 15, 2021 at 9:48 PM Alexey Trenikhun <yen...@msn.com> wrote: > > > Just in case, make sure that you are not using Kafka SSL port without > > setting security protocol, see [1] > > > > [1] https://issues.apache.org/jira/plugins/servlet/mobile#issue/KAFKA-4090 > > ------------------------------ > > *From:* bat man <tintin0...@gmail.com> > > *Sent:* Wednesday, July 14, 2021 10:55:54 AM > > *To:* Timo Walther <twal...@apache.org> > > *Cc:* user <user@flink.apache.org> > > *Subject:* Re: High DirectByteBuffer Usage > > > > Hi Timo, > > > > I am looking at these options. > > However, I had a couple of questions - > > 1. The off-heap usage grows overtime. My job does not do any off-heap > > operations so I don't think there is a leak there. Even after GC it keeps > > adding a few MBs after hours of running. > > 2. Secondly, I am seeing as the incoming record volume increases the > > off-heap usage grows. What's the reason for this? > > > > I am using 1.9. Is there any known bug which is causing this issue? > > > > Thanks, > > Hemant > > > > On Wed, Jul 14, 2021 at 7:30 PM Timo Walther <twal...@apache.org> wrote: > > > > Hi Hemant, > > > > did you checkout the dedicated page for memory configuration and > > troubleshooting: > > > > > > https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/memory/mem_trouble/#outofmemoryerror-direct-buffer-memory > > > > > > https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/memory/mem_trouble/#container-memory-exceeded > > > > It is likely that the high number of output streams could cause your > > issues. > > > > Regards, > > Timo > > > > > > > > > > On 14.07.21 08:46, bat man wrote: > > > Hi, > > > I have a job which reads different streams from 5 kafka topics. It > > > filters data and then data is streamed to different operators for > > > processing. This step involves data shuffling. > > > > > > Also, once data is enriched in 4 joins(KeyedCoProcessFunction) > > > operators. After joining the data is written to different kafka topics. > > > There are a total of 16 different output streams which are written to 4 > > > topics. > > > > > > I have been facing some issues with yarn killing containers. I took the > > > heap dump and ran it through JXray [1]. Heap usage is not high. One > > > thing which stands out is off-heap usage which is very high. My guess is > > > this is what is killing the containers as the data inflow increases. > > > > > > Screenshot 2021-07-14 at 11.52.41 AM.png > > > > > > > > > From the stack above is this usage high because of many output streams > > > being written to kafka topics. As the stack shows RecordWriter holding > > > off this DirectByteBuffer. I have assigned Network Memory as 1GB, and > > > --MaxDirectMemorySize also shows ~1GB for task managers. > > > > > > From here[2] I found that setting -Djdk.nio.maxCachedBufferSize=262144 > > > limits the temp buffer cache. Will it help in this case? > > > jvm version used is - JVM: OpenJDK 64-Bit Server VM - Red Hat, Inc. - > > > 1.8/25.282-b08 > > > > > > [1] - https://jxray.com <https://jxray.com> > > > [2] - > > > > > https://dzone.com/articles/troubleshooting-problems-with-native-off-heap-memo > > > < > > https://dzone.com/articles/troubleshooting-problems-with-native-off-heap-memo > > > > > > > > > Thanks, > > > Hemant > > > > >