I assume you will take action on your side to track and fix the doc? :) On Thu, Apr 28, 2022 at 11:12 AM John Smith <java.dev....@gmail.com> wrote:
> Ok so to summarize... > > - Build my job jar and have the JDBC driver as a compile only > dependency and copy the JDBC driver to flink lib folder. > > Or > > - Build my job jar and include JDBC driver in the shadow, plus copy the > JDBC driver in the flink lib folder, plus make an entry in config for > classloader.parent-first-patterns-additional > <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional> > > > On Thu, Apr 28, 2022 at 10:17 AM Chesnay Schepler <ches...@apache.org> > wrote: > >> I think what I meant was "either add it to /lib, or [if it is already in >> /lib but also bundled in the jar] add it to the parent-first patterns." >> >> On 28/04/2022 15:56, Chesnay Schepler wrote: >> >> Pretty sure, even though I seemingly documented it incorrectly :) >> >> On 28/04/2022 15:49, John Smith wrote: >> >> You sure? >> >> - >> >> *JDBC*: JDBC drivers leak references outside the user code >> classloader. To ensure that these classes are only loaded once you should >> either add the driver jars to Flink’s lib/ folder, or add the driver >> classes to the list of parent-first loaded class via >> classloader.parent-first-patterns-additional >> >> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional> >> . >> >> It says either or >> >> >> On Wed, Apr 27, 2022 at 3:44 AM Chesnay Schepler <ches...@apache.org> >> wrote: >> >>> You're misinterpreting the docs. >>> >>> The parent/child-first classloading controls where Flink looks for a >>> class *first*, specifically whether we first load from /lib or the >>> user-jar. >>> It does not allow you to load something from the user-jar in the parent >>> classloader. That's just not how it works. >>> >>> It must be in /lib. >>> >>> On 27/04/2022 04:59, John Smith wrote: >>> >>> Hi Chesnay as per the docs... >>> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/ >>> >>> You can either put the jars in task manager lib folder or use >>> classloader.parent-first-patterns-additional >>> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional> >>> >>> I prefer the latter like this: the dependency stays with the user-jar >>> and not on the task manager. >>> >>> On Tue, Apr 26, 2022 at 9:52 PM John Smith <java.dev....@gmail.com> >>> wrote: >>> >>>> Ok so I should put the Apache ignite and my Microsoft drivers in the >>>> lib folders of my task managers? >>>> >>>> And then in my job jar only include them as compile time dependencies? >>>> >>>> >>>> On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler <ches...@apache.org> >>>> wrote: >>>> >>>>> JDBC drivers are well-known for leaking classloaders unfortunately. >>>>> >>>>> You have correctly identified your alternatives. >>>>> >>>>> You must put the jdbc driver into /lib instead. Setting only the >>>>> parent-first pattern shouldn't affect anything. >>>>> That is only relevant if something is in both in /lib and the >>>>> user-jar, telling Flink to prioritize what is in lib. >>>>> >>>>> >>>>> >>>>> On 26/04/2022 15:35, John Smith wrote: >>>>> >>>>> So I put classloader.parent-first-patterns.additional: >>>>> "org.apache.ignite." in the task config and so far I don't think I'm >>>>> getting "java.lang.OutOfMemoryError: Metaspace" any more. >>>>> >>>>> Or it's too early to tell. >>>>> >>>>> Though now, the task managers are shutting down due to some >>>>> other failures. >>>>> >>>>> So maybe because tasks were failing and reloading often the task >>>>> manager was running out of Metspace. But now maybe it's just >>>>> cleanly shutting down. >>>>> >>>>> On Wed, Apr 20, 2022 at 11:35 AM John Smith <java.dev....@gmail.com> >>>>> wrote: >>>>> >>>>>> Or I can put in the config to treat org.apache.ignite. classes as >>>>>> first class? >>>>>> >>>>>> On Tue, Apr 19, 2022 at 10:18 PM John Smith <java.dev....@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Ok, so I loaded the dump into Eclipse Mat and followed: >>>>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks >>>>>>> >>>>>>> - On the Histogram, I got over 30 entries for: ChildFirstClassLoader >>>>>>> - Then I clicked on one of them "Merge Shortest Path..." and picked >>>>>>> "Exclude all phantom/weak/soft references" >>>>>>> - Which then gave me: SqlDriverManager > Apache Ignite JdbcThin >>>>>>> Driver >>>>>>> >>>>>>> So i'm guessing anything JDBC based. I should copy into the task >>>>>>> manager libs folder and my jobs make the dependencies as compile only? >>>>>>> >>>>>>> On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko < >>>>>>> yaros...@goldsky.io> wrote: >>>>>>> >>>>>>>> Also >>>>>>>> https://shopify.engineering/optimizing-apache-flink-applications-tips >>>>>>>> might be helpful (has a section on profiling, as well as classloading). >>>>>>>> >>>>>>>> On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler < >>>>>>>> ches...@apache.org> wrote: >>>>>>>> >>>>>>>>> We have a very rough "guide" in the wiki (it's just the specific >>>>>>>>> steps I took to debug another leak): >>>>>>>>> >>>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks >>>>>>>>> >>>>>>>>> On 19/04/2022 12:01, huweihua wrote: >>>>>>>>> >>>>>>>>> Hi, John >>>>>>>>> >>>>>>>>> Sorry for the late reply. You can use MAT[1] to analyze the dump >>>>>>>>> file. Check whether have too many loaded classes. >>>>>>>>> >>>>>>>>> [1] https://www.eclipse.org/mat/ >>>>>>>>> >>>>>>>>> 2022年4月18日 下午9:55,John Smith <java.dev....@gmail.com> 写道: >>>>>>>>> >>>>>>>>> Hi, can anyone help with this? I never looked at a dump file >>>>>>>>> before. >>>>>>>>> >>>>>>>>> On Thu, Apr 14, 2022 at 11:59 AM John Smith < >>>>>>>>> java.dev....@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi, so I have a dump file. What do I look for? >>>>>>>>>> >>>>>>>>>> On Thu, Mar 31, 2022 at 3:28 PM John Smith < >>>>>>>>>> java.dev....@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Ok so if there's a leak, if I manually stop the job and restart >>>>>>>>>>> it from the UI multiple times, I won't see the issue because >>>>>>>>>>> because the >>>>>>>>>>> classes are unloaded correctly? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, Mar 31, 2022 at 9:20 AM huweihua <huweihua....@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> The difference is that manually canceling the job stops the >>>>>>>>>>>> JobMaster, but automatic failover keeps the JobMaster running. But >>>>>>>>>>>> looking >>>>>>>>>>>> on TaskManager, it doesn't make much difference >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> 2022年3月31日 上午4:01,John Smith <java.dev....@gmail.com> 写道: >>>>>>>>>>>> >>>>>>>>>>>> Also if I manually cancel and restart the same job over and >>>>>>>>>>>> over is it the same as if flink was restarting a job due to >>>>>>>>>>>> failure? >>>>>>>>>>>> >>>>>>>>>>>> I.e: When I click "Cancel Job" on the UI is the job completely >>>>>>>>>>>> unloaded vs when the job scheduler restarts a job because if >>>>>>>>>>>> whatever >>>>>>>>>>>> reason? >>>>>>>>>>>> >>>>>>>>>>>> Lile this I'll stop and restart the job a few times or maybe I >>>>>>>>>>>> can trick my job to fail and have the scheduler restart it. Ok let >>>>>>>>>>>> me think >>>>>>>>>>>> about this... >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 <huweihua....@gmail.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> So if I run the same jobs in my dev env will I still be able >>>>>>>>>>>>> to see the similar dump? >>>>>>>>>>>>> >>>>>>>>>>>>> I think running the same job in dev should be reproducible, >>>>>>>>>>>>> maybe you can have a try. >>>>>>>>>>>>> >>>>>>>>>>>>> If not I would have to wait at a low volume time to do it on >>>>>>>>>>>>> production. Aldo if I recall the dump is as big as the JVM memory >>>>>>>>>>>>> right so >>>>>>>>>>>>> if I have 10GB configed for the JVM the dump will be 10GB file? >>>>>>>>>>>>> >>>>>>>>>>>>> Yes, JMAP will pause the JVM, the time of pause depends on the >>>>>>>>>>>>> size to dump. you can use "jmap -dump:live" to dump only the >>>>>>>>>>>>> reachable >>>>>>>>>>>>> objects, this will take a brief pause >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> 2022年3月30日 下午9:47,John Smith <java.dev....@gmail.com> 写道: >>>>>>>>>>>>> >>>>>>>>>>>>> I have 3 task managers (see config below). There is total of >>>>>>>>>>>>> 10 jobs with 25 slots being used. >>>>>>>>>>>>> The jobs are 100% ETL I.e; They load Json, transform it and >>>>>>>>>>>>> push it to JDBC, only 1 job of the 10 is pushing to Apache Ignite >>>>>>>>>>>>> cluster. >>>>>>>>>>>>> >>>>>>>>>>>>> FOR JMAP. I know that it will pause the task manager. So if I >>>>>>>>>>>>> run the same jobs in my dev env will I still be able to see the >>>>>>>>>>>>> similar >>>>>>>>>>>>> dump? I I assume so. If not I would have to wait at a low volume >>>>>>>>>>>>> time to do >>>>>>>>>>>>> it on production. Aldo if I recall the dump is as big as the JVM >>>>>>>>>>>>> memory >>>>>>>>>>>>> right so if I have 10GB configed for the JVM the dump will be >>>>>>>>>>>>> 10GB file? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> # Operating system has 16GB total. >>>>>>>>>>>>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no >>>>>>>>>>>>> >>>>>>>>>>>>> cluster.evenly-spread-out-slots: true >>>>>>>>>>>>> >>>>>>>>>>>>> taskmanager.memory.flink.size: 10240m >>>>>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m >>>>>>>>>>>>> taskmanager.numberOfTaskSlots: 16 >>>>>>>>>>>>> parallelism.default: 1 >>>>>>>>>>>>> >>>>>>>>>>>>> high-availability: zookeeper >>>>>>>>>>>>> high-availability.storageDir: file:///mnt/flink/ha/flink_1_14/ >>>>>>>>>>>>> high-availability.zookeeper.quorum: ... >>>>>>>>>>>>> high-availability.zookeeper.path.root: /flink_1_14 >>>>>>>>>>>>> high-availability.cluster-id: /flink_1_14_cluster_0001 >>>>>>>>>>>>> >>>>>>>>>>>>> web.upload.dir: /mnt/flink/uploads/flink_1_14 >>>>>>>>>>>>> >>>>>>>>>>>>> state.backend: rocksdb >>>>>>>>>>>>> state.backend.incremental: true >>>>>>>>>>>>> state.checkpoints.dir: >>>>>>>>>>>>> file:///mnt/flink/checkpoints/flink_1_14 >>>>>>>>>>>>> state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14 >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华 <huweihua....@gmail.com> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi, John >>>>>>>>>>>>>> >>>>>>>>>>>>>> Could you tell us you application scenario? Is it a flink >>>>>>>>>>>>>> session cluster with a lot of jobs? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Maybe you can try to dump the memory with jmap and use tools >>>>>>>>>>>>>> such as MAT to analyze whether there are abnormal classes and >>>>>>>>>>>>>> classloaders >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> > 2022年3月30日 上午6:09,John Smith <java.dev....@gmail.com> 写道: >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > Hi running 1.14.4 >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > My tasks manager still fails with >>>>>>>>>>>>>> java.lang.OutOfMemoryError: Metaspace. The metaspace >>>>>>>>>>>>>> out-of-memory error >>>>>>>>>>>>>> has occurred. This can mean two things: either the job requires >>>>>>>>>>>>>> a larger >>>>>>>>>>>>>> size of JVM metaspace to load classes or there is a class >>>>>>>>>>>>>> loading leak. >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > I have 2GB of metaspace configed >>>>>>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > But the task nodes still fail. >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > When looking at the UI metrics, the metaspace starts low. >>>>>>>>>>>>>> Now I see 85% usage. It seems to be a class loading leak at this >>>>>>>>>>>>>> point, how >>>>>>>>>>>>>> can we debug this issue? >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>> >>> >> >>