Hi Chesnay as per the docs... https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/
You can either put the jars in task manager lib folder or use classloader.parent-first-patterns-additional <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional> I prefer the latter like this: the dependency stays with the user-jar and not on the task manager. On Tue, Apr 26, 2022 at 9:52 PM John Smith <java.dev....@gmail.com> wrote: > Ok so I should put the Apache ignite and my Microsoft drivers in the lib > folders of my task managers? > > And then in my job jar only include them as compile time dependencies? > > > On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler <ches...@apache.org> > wrote: > >> JDBC drivers are well-known for leaking classloaders unfortunately. >> >> You have correctly identified your alternatives. >> >> You must put the jdbc driver into /lib instead. Setting only the >> parent-first pattern shouldn't affect anything. >> That is only relevant if something is in both in /lib and the user-jar, >> telling Flink to prioritize what is in lib. >> >> >> >> On 26/04/2022 15:35, John Smith wrote: >> >> So I put classloader.parent-first-patterns.additional: >> "org.apache.ignite." in the task config and so far I don't think I'm >> getting "java.lang.OutOfMemoryError: Metaspace" any more. >> >> Or it's too early to tell. >> >> Though now, the task managers are shutting down due to some >> other failures. >> >> So maybe because tasks were failing and reloading often the task manager >> was running out of Metspace. But now maybe it's just cleanly shutting down. >> >> On Wed, Apr 20, 2022 at 11:35 AM John Smith <java.dev....@gmail.com> >> wrote: >> >>> Or I can put in the config to treat org.apache.ignite. classes as first >>> class? >>> >>> On Tue, Apr 19, 2022 at 10:18 PM John Smith <java.dev....@gmail.com> >>> wrote: >>> >>>> Ok, so I loaded the dump into Eclipse Mat and followed: >>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks >>>> >>>> - On the Histogram, I got over 30 entries for: ChildFirstClassLoader >>>> - Then I clicked on one of them "Merge Shortest Path..." and picked >>>> "Exclude all phantom/weak/soft references" >>>> - Which then gave me: SqlDriverManager > Apache Ignite JdbcThin Driver >>>> >>>> So i'm guessing anything JDBC based. I should copy into the task >>>> manager libs folder and my jobs make the dependencies as compile only? >>>> >>>> On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko < >>>> yaros...@goldsky.io> wrote: >>>> >>>>> Also >>>>> https://shopify.engineering/optimizing-apache-flink-applications-tips >>>>> might be helpful (has a section on profiling, as well as classloading). >>>>> >>>>> On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler <ches...@apache.org> >>>>> wrote: >>>>> >>>>>> We have a very rough "guide" in the wiki (it's just the specific >>>>>> steps I took to debug another leak): >>>>>> >>>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks >>>>>> >>>>>> On 19/04/2022 12:01, huweihua wrote: >>>>>> >>>>>> Hi, John >>>>>> >>>>>> Sorry for the late reply. You can use MAT[1] to analyze the dump >>>>>> file. Check whether have too many loaded classes. >>>>>> >>>>>> [1] https://www.eclipse.org/mat/ >>>>>> >>>>>> 2022年4月18日 下午9:55,John Smith <java.dev....@gmail.com> 写道: >>>>>> >>>>>> Hi, can anyone help with this? I never looked at a dump file before. >>>>>> >>>>>> On Thu, Apr 14, 2022 at 11:59 AM John Smith <java.dev....@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi, so I have a dump file. What do I look for? >>>>>>> >>>>>>> On Thu, Mar 31, 2022 at 3:28 PM John Smith <java.dev....@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Ok so if there's a leak, if I manually stop the job and restart it >>>>>>>> from the UI multiple times, I won't see the issue because because the >>>>>>>> classes are unloaded correctly? >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Mar 31, 2022 at 9:20 AM huweihua <huweihua....@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> The difference is that manually canceling the job stops the >>>>>>>>> JobMaster, but automatic failover keeps the JobMaster running. But >>>>>>>>> looking >>>>>>>>> on TaskManager, it doesn't make much difference >>>>>>>>> >>>>>>>>> >>>>>>>>> 2022年3月31日 上午4:01,John Smith <java.dev....@gmail.com> 写道: >>>>>>>>> >>>>>>>>> Also if I manually cancel and restart the same job over and over >>>>>>>>> is it the same as if flink was restarting a job due to failure? >>>>>>>>> >>>>>>>>> I.e: When I click "Cancel Job" on the UI is the job completely >>>>>>>>> unloaded vs when the job scheduler restarts a job because if whatever >>>>>>>>> reason? >>>>>>>>> >>>>>>>>> Lile this I'll stop and restart the job a few times or maybe I can >>>>>>>>> trick my job to fail and have the scheduler restart it. Ok let me >>>>>>>>> think >>>>>>>>> about this... >>>>>>>>> >>>>>>>>> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 <huweihua....@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> So if I run the same jobs in my dev env will I still be able to >>>>>>>>>> see the similar dump? >>>>>>>>>> >>>>>>>>>> I think running the same job in dev should be reproducible, maybe >>>>>>>>>> you can have a try. >>>>>>>>>> >>>>>>>>>> If not I would have to wait at a low volume time to do it on >>>>>>>>>> production. Aldo if I recall the dump is as big as the JVM memory >>>>>>>>>> right so >>>>>>>>>> if I have 10GB configed for the JVM the dump will be 10GB file? >>>>>>>>>> >>>>>>>>>> Yes, JMAP will pause the JVM, the time of pause depends on the >>>>>>>>>> size to dump. you can use "jmap -dump:live" to dump only the >>>>>>>>>> reachable >>>>>>>>>> objects, this will take a brief pause >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 2022年3月30日 下午9:47,John Smith <java.dev....@gmail.com> 写道: >>>>>>>>>> >>>>>>>>>> I have 3 task managers (see config below). There is total of 10 >>>>>>>>>> jobs with 25 slots being used. >>>>>>>>>> The jobs are 100% ETL I.e; They load Json, transform it and push >>>>>>>>>> it to JDBC, only 1 job of the 10 is pushing to Apache Ignite cluster. >>>>>>>>>> >>>>>>>>>> FOR JMAP. I know that it will pause the task manager. So if I run >>>>>>>>>> the same jobs in my dev env will I still be able to see the similar >>>>>>>>>> dump? I >>>>>>>>>> I assume so. If not I would have to wait at a low volume time to do >>>>>>>>>> it on >>>>>>>>>> production. Aldo if I recall the dump is as big as the JVM memory >>>>>>>>>> right so >>>>>>>>>> if I have 10GB configed for the JVM the dump will be 10GB file? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> # Operating system has 16GB total. >>>>>>>>>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no >>>>>>>>>> >>>>>>>>>> cluster.evenly-spread-out-slots: true >>>>>>>>>> >>>>>>>>>> taskmanager.memory.flink.size: 10240m >>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m >>>>>>>>>> taskmanager.numberOfTaskSlots: 16 >>>>>>>>>> parallelism.default: 1 >>>>>>>>>> >>>>>>>>>> high-availability: zookeeper >>>>>>>>>> high-availability.storageDir: file:///mnt/flink/ha/flink_1_14/ >>>>>>>>>> high-availability.zookeeper.quorum: ... >>>>>>>>>> high-availability.zookeeper.path.root: /flink_1_14 >>>>>>>>>> high-availability.cluster-id: /flink_1_14_cluster_0001 >>>>>>>>>> >>>>>>>>>> web.upload.dir: /mnt/flink/uploads/flink_1_14 >>>>>>>>>> >>>>>>>>>> state.backend: rocksdb >>>>>>>>>> state.backend.incremental: true >>>>>>>>>> state.checkpoints.dir: file:///mnt/flink/checkpoints/flink_1_14 >>>>>>>>>> state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14 >>>>>>>>>> >>>>>>>>>> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华 <huweihua....@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi, John >>>>>>>>>>> >>>>>>>>>>> Could you tell us you application scenario? Is it a flink >>>>>>>>>>> session cluster with a lot of jobs? >>>>>>>>>>> >>>>>>>>>>> Maybe you can try to dump the memory with jmap and use tools >>>>>>>>>>> such as MAT to analyze whether there are abnormal classes and >>>>>>>>>>> classloaders >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> > 2022年3月30日 上午6:09,John Smith <java.dev....@gmail.com> 写道: >>>>>>>>>>> > >>>>>>>>>>> > Hi running 1.14.4 >>>>>>>>>>> > >>>>>>>>>>> > My tasks manager still fails with java.lang.OutOfMemoryError: >>>>>>>>>>> Metaspace. The metaspace out-of-memory error has occurred. This can >>>>>>>>>>> mean >>>>>>>>>>> two things: either the job requires a larger size of JVM metaspace >>>>>>>>>>> to load >>>>>>>>>>> classes or there is a class loading leak. >>>>>>>>>>> > >>>>>>>>>>> > I have 2GB of metaspace configed >>>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m >>>>>>>>>>> > >>>>>>>>>>> > But the task nodes still fail. >>>>>>>>>>> > >>>>>>>>>>> > When looking at the UI metrics, the metaspace starts low. Now >>>>>>>>>>> I see 85% usage. It seems to be a class loading leak at this point, >>>>>>>>>>> how can >>>>>>>>>>> we debug this issue? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>> >>>>>> >>