Re: How to debug Metaspace exception?

Chesnay Schepler Thu, 28 Apr 2022 07:17:15 -0700

I think what I meant was "either add it to /lib, or [if it is already in/lib but also bundled in the jar] add it to the parent-first patterns."


On 28/04/2022 15:56, Chesnay Schepler wrote:

Pretty sure, even though I seemingly documented it incorrectly :)


On 28/04/2022 15:49, John Smith wrote:

You sure?

 *

    /JDBC/: JDBC drivers leak references outside the user code
    classloader. To ensure that these classes are only loaded once
    you should either add the driver jars to Flink’s |lib/| folder,
    or add the driver classes to the list of parent-first loaded
    class via |classloader.parent-first-patterns-additional|
    
<https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>.

    It says either or

On Wed, Apr 27, 2022 at 3:44 AM Chesnay Schepler <ches...@apache.org>wrote:


    You're misinterpreting the docs.

    The parent/child-first classloading controls where Flink looks
    for a class /first/, specifically whether we first load from /lib
    or the user-jar.
    It does not allow you to load something from the user-jar in the
    parent classloader. That's just not how it works.

    It must be in /lib.

    On 27/04/2022 04:59, John Smith wrote:

    Hi Chesnay as per the docs...
    
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/

    You can either put the jars in task manager lib folder or use
    |classloader.parent-first-patterns-additional|
    
<https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>

    I prefer the latter like this: the dependency stays with the
    user-jar and not on the task manager.

    On Tue, Apr 26, 2022 at 9:52 PM John Smith
    <java.dev....@gmail.com> wrote:

        Ok so I should put the Apache ignite and my Microsoft
        drivers in the lib folders of my task managers?

        And then in my job jar only include them as compile time
        dependencies?


        On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler
        <ches...@apache.org> wrote:

            JDBC drivers are well-known for leaking classloaders
            unfortunately.

            You have correctly identified your alternatives.

            You must put the jdbc driver into /lib instead. Setting
            only the parent-first pattern shouldn't affect anything.
            That is only relevant if something is in both in /lib
            and the user-jar, telling Flink to prioritize what is in
            lib.



            On 26/04/2022 15:35, John Smith wrote:

            So I put classloader.parent-first-patterns.additional:
            "org.apache.ignite." in the task config and so far I
            don't think I'm getting "java.lang.OutOfMemoryError:
            Metaspace" any more.

            Or it's too early to tell.

            Though now, the task managers are shutting down due to
            some other failures.

            So maybe because tasks were failing and reloading often
            the task manager was running out of Metspace. But now
            maybe it's just cleanly shutting down.

            On Wed, Apr 20, 2022 at 11:35 AM John Smith
            <java.dev....@gmail.com> wrote:

                Or I can put in the config to treat
                org.apache.ignite. classes as first class?

                On Tue, Apr 19, 2022 at 10:18 PM John Smith
                <java.dev....@gmail.com> wrote:

                    Ok, so I loaded the dump into Eclipse Mat and
                    followed:
                    
https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks

                    - On the Histogram, I got over 30 entries for:
                    ChildFirstClassLoader
                    - Then I clicked on one of them "Merge Shortest
                    Path..." and picked "Exclude all
                    phantom/weak/soft references"
                    - Which then gave me: SqlDriverManager > Apache
                    Ignite JdbcThin Driver

                    So i'm guessing anything JDBC based. I should
                    copy into the task manager libs folder and my
                    jobs make the dependencies as compile only?

                    On Tue, Apr 19, 2022 at 12:18 PM Yaroslav
                    Tkachenko <yaros...@goldsky.io> wrote:

                        Also
                        
https://shopify.engineering/optimizing-apache-flink-applications-tips
                        might be helpful (has a section on
                        profiling, as well as classloading).

                        On Tue, Apr 19, 2022 at 4:35 AM Chesnay
                        Schepler <ches...@apache.org> wrote:

                            We have a very rough "guide" in the
                            wiki (it's just the specific steps I
                            took to debug another leak):
                            
https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks

                            On 19/04/2022 12:01, huweihua wrote:

                            Hi, John

                            Sorry for the late reply. You can use
                            MAT[1] to analyze the dump file. Check
                            whether have too many loaded classes.

                            [1] https://www.eclipse.org/mat/

                            2022年4月18日 下午9:55，John Smith
                            <java.dev....@gmail.com> 写道：

                            Hi, can anyone help with this? I
                            never looked at a dump file before.

                            On Thu, Apr 14, 2022 at 11:59 AM John
                            Smith <java.dev....@gmail.com> wrote:

                                Hi, so I have a dump file. What
                                do I look for?

                                On Thu, Mar 31, 2022 at 3:28 PM
                                John Smith
                                <java.dev....@gmail.com> wrote:

                                    Ok so if there's a leak, if I
                                    manually stop the job and
                                    restart it from the UI
                                    multiple times, I won't see
                                    the issue because because the
                                    classes are unloaded correctly?


                                    On Thu, Mar 31, 2022 at 9:20
                                    AM huweihua
                                    <huweihua....@gmail.com> wrote:


                                        The difference is that
                                        manually canceling the
                                        job stops the JobMaster,
                                        but automatic failover
                                        keeps the JobMaster
                                        running. But looking on
                                        TaskManager, it doesn't
                                        make much difference

                                        2022年3月31日 上午4:01，John
                                        Smith
                                        <java.dev....@gmail.com>
                                        写道：

                                        Also if I manually
                                        cancel and restart the
                                        same job over and over
                                        is it the same as if
                                        flink was restarting a
                                        job due to failure?

                                        I.e: When I click
                                        "Cancel Job" on the UI
                                        is the job completely
                                        unloaded vs when the job
                                        scheduler restarts a job
                                        because if whatever reason?

                                        Lile this I'll stop and
                                        restart the job a few
                                        times or maybe I can
                                        trick my job to fail and
                                        have the scheduler
                                        restart it. Ok let me
                                        think about this...

                                        On Wed, Mar 30, 2022 at
                                        10:24 AM 胡伟华
                                        <huweihua....@gmail.com>
                                        wrote:

                                            So if I run the
                                            same jobs in my dev
                                            env will I still be
                                            able to see the
                                            similar dump?

                                            I think running the
                                            same job in dev
                                            should be
                                            reproducible, maybe
                                            you can have a try.

                                             If not I would
                                            have to wait at a
                                            low volume time to
                                            do it on
                                            production. Aldo if
                                            I recall the dump
                                            is as big as the
                                            JVM memory right so
                                            if I have 10GB
                                            configed for the
                                            JVM the dump will
                                            be 10GB file?

                                            Yes, JMAP will pause
                                            the JVM, the time of
                                            pause depends on the
                                            size to dump. you
                                            can use "jmap
                                            -dump:live" to dump
                                            only the reachable
                                            objects, this will
                                            take a brief pause

                                            2022年3月30日
                                            下午9:47，John Smith
                                            <java.dev....@gmail.com>
                                            写道：

                                            I have 3 task
                                            managers (see
                                            config below).
                                            There is total of
                                            10 jobs with 25
                                            slots being used.
                                            The jobs are 100%
                                            ETL I.e; They load
                                            Json, transform it
                                            and push it to
                                            JDBC, only 1 job of
                                            the 10 is pushing
                                            to Apache Ignite
                                            cluster.

                                            FOR JMAP. I know
                                            that it will pause
                                            the task manager.
                                            So if I run the
                                            same jobs in my dev
                                            env will I still be
                                            able to see the
                                            similar dump? I I
                                            assume so. If not I
                                            would have to wait
                                            at a low volume
                                            time to do it on
                                            production. Aldo if
                                            I recall the dump
                                            is as big as the
                                            JVM memory right so
                                            if I have 10GB
                                            configed for the
                                            JVM the dump will
                                            be 10GB file?


                                            # Operating system
                                            has 16GB total.
                                            env.ssh.opts: -l
                                            flink
                                            -oStrictHostKeyChecking=no

                                            cluster.evenly-spread-out-slots:
                                            true

                                            taskmanager.memory.flink.size:
                                            10240m
                                            
taskmanager.memory.jvm-metaspace.size:
                                            2048m
                                            taskmanager.numberOfTaskSlots:
                                            16
                                            parallelism.default: 1

                                            high-availability:
                                            zookeeper
                                            high-availability.storageDir:
                                            file:///mnt/flink/ha/flink_1_14/
                                            high-availability.zookeeper.quorum:
                                            ...
                                            
high-availability.zookeeper.path.root:
                                            /flink_1_14
                                            high-availability.cluster-id:
                                            /flink_1_14_cluster_0001

                                            web.upload.dir:
                                            /mnt/flink/uploads/flink_1_14

                                            state.backend: rocksdb
                                            state.backend.incremental:
                                            true
                                            state.checkpoints.dir:
                                            
file:///mnt/flink/checkpoints/flink_1_14
                                            state.savepoints.dir:
                                            
file:///mnt/flink/savepoints/flink_1_14

                                            On Wed, Mar 30,
                                            2022 at 2:16 AM 胡伟华
                                            <huweihua....@gmail.com>
                                            wrote:

                                                Hi, John

                                                Could you tell
                                                us you
                                                application
                                                scenario? Is it
                                                a flink session
                                                cluster with a
                                                lot of jobs?

                                                Maybe you can
                                                try to dump the
                                                memory with
                                                jmap and use
                                                tools such as
                                                MAT to analyze
                                                whether there
                                                are abnormal
                                                classes and
                                                classloaders


                                                > 2022年3月30日
                                                上午6:09，John
                                                Smith
                                                <java.dev....@gmail.com>
                                                写道：
                                                >
                                                > Hi running 1.14.4
                                                >
                                                > My tasks
                                                manager still
                                                fails with
                                                java.lang.OutOfMemoryError:
                                                Metaspace. The
                                                metaspace
                                                out-of-memory
                                                error has
                                                occurred. This
                                                can mean two
                                                things: either
                                                the job
                                                requires a
                                                larger size of
                                                JVM metaspace
                                                to load classes
                                                or there is a
                                                class loading leak.
                                                >
                                                > I have 2GB of
                                                metaspace
                                                configed
                                                
taskmanager.memory.jvm-metaspace.size:
                                                2048m
                                                >
                                                > But the task
                                                nodes still fail.
                                                >
                                                > When looking
                                                at the UI
                                                metrics, the
                                                metaspace
                                                starts low. Now
                                                I see 85%
                                                usage. It seems
                                                to be a class
                                                loading leak at
                                                this point, how
                                                can we debug
                                                this issue?

Re: How to debug Metaspace exception?

Reply via email to