Hi, Questions that @matth...@ververica.com <matth...@ververica.com> asked are very valid and might provide more leads. But if you haven't already then it's worth trying to use jemalloc / tcmalloc. We had similar problems with slow growth in TM memory resulting in pods getting OOMed by k8s. After switching to jemalloc, the memory foot print improved dramatically.
- Dhanesh Arole ( Sent from mobile device. Pardon me for typos ) On Thu, Apr 22, 2021 at 1:39 PM Matthias Pohl <matth...@ververica.com> wrote: > Hi, > I have a few questions about your case: > * What is the option you're referring to for the bounded shuffle? That > might help to understand what streaming mode solution you're looking for. > * What does the job graph look like? Are you assuming that it's due to a > shuffling operation? Could you provide the logs to get a better > understanding of your case? > * Do you observe the same memory increase for other TaskManager nodes? > * Are you expecting to reach the memory limits considering that you > mentioned a "big state size"? Would increasing the memory limit be an > option or do you fear that it's caused by some memory leak? > > Bet, > Matthias > > On Fri, Apr 16, 2021 at 10:24 AM 马阳阳 <ma_yang_y...@163.com> wrote: > >> The Flink version we used is 1.12.0. >> >> 马阳阳 >> ma_yang_y...@163.com >> >> <https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=%E9%A9%AC%E9%98%B3%E9%98%B3&uid=ma_yang_yang%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fsmdd35b1dbe9f6ac559bc7315871d3e51a.jpg&items=%5B%22ma_yang_yang%40163.com%22%5D> >> 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail81> 定制 >> >> On 04/16/2021 16:07,马阳阳<ma_yang_y...@163.com> <ma_yang_y...@163.com> >> wrote: >> >> Hi, community, >> When running a Flink streaming job with big state size, one task manager >> process was killed by the yarn node manager. The following log is from the >> yarn node manager: >> >> 2021-04-16 11:51:23,013 WARN >> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: >> Container >> [pid=521232,containerID=container_e157_1618223445363_16943_01_000010] is >> running 19562496B beyond the 'PHYSICAL' memory limit. Current usage: 12.0 >> GB of 12 GB physical memory used; 15.2 GB of 25.2 GB virtual memory used. >> Killing container. >> >> When searching solution for this problem, I found that there is a option >> for this that worked for bounded shuffle. So is there a way to get rid of >> this in streaming mode? >> >> PS: >> memory related options: >> taskmanager.memory.process.size:12288m >> taskmanager.memory.managed.fraction:0.7 >> >>