Thanks for the responses. I did switch to per-job mode and it is working well of course. I suspected there wouldn't be an easy solution, but I had to ask. Thanks!
On Fri, Jan 7, 2022 at 3:37 AM David Morávek <david.mora...@gmail.com> wrote: > Hi David, > > If I understand the problem correctly, there is really nothing we can do > here. Soft references are garbage collected when there is a high memory > pressure and the garbage collector needs to free up more memory. The > problem here is that the GC doesn't really take high memory pressure on > Metaspace into the account here. > > I guess you might try to tweak _SoftRefLRUPolicyMSPerMB_ [1], but this > might have some other consequences. Also this behavior might be highly > dependent on the garbage collector you're using. > > > From the docs [1]: > > -XX:SoftRefLRUPolicyMSPerMB=*time* > > Sets the amount of time (in milliseconds) a softly reachable object is > kept active on the heap after the last time it was referenced. The default > value is one second of lifetime per free megabyte in the heap. The > -XX:SoftRefLRUPolicyMSPerMB option accepts integer values representing > milliseconds per one megabyte of the current heap size (for Java HotSpot > Client VM) or the maximum possible heap size (for Java HotSpot Server VM). > This difference means that the Client VM tends to flush soft references > rather than grow the heap, whereas the Server VM tends to grow the heap > rather than flush soft references. In the latter case, the value of the > -Xmx option has a significant effect on how quickly soft references are > garbage collected. > > The following example shows how to set the value to 2.5 seconds: > > -XX:SoftRefLRUPolicyMSPerMB=2500 > > > > [1] https://docs.oracle.com/javase/8/docs/technotes/tools/unix/java.html > <https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.oracle.com_javase_8_docs_technotes_tools_unix_java.html&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=N4km4Rf3oIx3rDuEmXYriVDklMwDUWP1cZmHhRiSF9M&m=2fXnoO8MKBq79Iwx9_KPG9ercos_tZ_wsxqJ6YgvvApwWcwPix5uLZIH8h3okNS5&s=f8bGwijTgHBdQ61s0koxadm3EGFVJhN-uTSF_hSNX_w&e=> > > Best, > D. > > On Thu, Jan 6, 2022 at 3:13 AM Caizhi Weng <tsreape...@gmail.com> wrote: > >> Hi! >> >> As far as I remember this is a known issue a few years ago but Flink >> currently has no solution to this (correct me if I'm wrong). I see that >> you're running jobs on a yarn session. Could you switch to yarn-per-job >> mode (where JM and TMs are created and destroyed for each job) for a >> workaround? >> >> David Clutter <dclut...@yahooinc.com> 于2022年1月4日周二 23:39写道: >> >>> I am seeing an issue with class loaders not being GCed and the metaspace >>> eventually OOM. Here is my setup: >>> >>> - Flink 1.13.1 on EMR using JDK 8 in session mode >>> - Job manager is a long-running yarn session >>> - New jobs are submitted every 5m (and typically run for less than 5m) >>> >>> I find that after a few hours the job manager gets killed with Metaspace >>> OOM. I tried increasing the Metaspace for the job manager but that only >>> delays the OOM. >>> >>> I did some debugging using jcmd and I noticed that the size of the >>> classes loaded is always increasing. Next I did a heap dump and found that >>> instances of org.apache.flink.util.ChildFirstClassLoader are present >>> long after the jobs complete. Checking the GC roots I found that there is >>> a reference in java.io.ObjectStreamClass$Caches. Seems to be this JDK >>> issue: https://bugs.openjdk.java.net/browse/JDK-8277072 >>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.openjdk.java.net_browse_JDK-2D8277072&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=N4km4Rf3oIx3rDuEmXYriVDklMwDUWP1cZmHhRiSF9M&m=2fXnoO8MKBq79Iwx9_KPG9ercos_tZ_wsxqJ6YgvvApwWcwPix5uLZIH8h3okNS5&s=W4jBIDaDDNV1dFK9jTlmX_KxS0r2KG2JXjIxFlgQ4XY&e=> >>> >>> Curious if there are any workarounds for this situation? >>> >>>