Re: How to debug Metaspace exception?

Chesnay Schepler Mon, 02 May 2022 08:01:27 -0700

There are cases where user-code is run on the JobManager.
I'm not sure whether though that applies to the JDBC sources.


On 02/05/2022 15:45, John Smith wrote:

Why do the JDBC jars need to be on the job manager node though?

On Mon, May 2, 2022 at 9:36 AM Chesnay Schepler <ches...@apache.org>wrote:


    yes.
    But if you can ensure that the driver isn't bundled by any
    user-jar you can also skip the pattern configuration step.

    The pattern looks correct formatting-wise; you could try whether
    com.microsoft.sqlserver.jdbc. is enough to solve the issue.

    On 02/05/2022 14:41, John Smith wrote:

    Oh, so I should copy the jars to the lib folder and
    set classloader.parent-first-patterns.additional:
    "org.apache.ignite.;com.microsoft.sqlserver.jdbc." to both the
    task managers and job managers?

    Also is my pattern correct?
    "org.apache.ignite.;com.microsoft.sqlserver.jdbc."

    Just to be sure I'm running a standalone cluster using zookeeper.
    So I have 3 zookeepers, 3 job managers and 3 task managers.


    On Mon, May 2, 2022 at 2:57 AM Chesnay Schepler
    <ches...@apache.org> wrote:

        And you do should make sure that it is set for both processes!

        On 02/05/2022 08:43, Chesnay Schepler wrote:

        The setting itself isn't taskmanager specific; it applies to
        both the job- and taskmanager process.

        On 02/05/2022 05:29, John Smith wrote:

        Also just to be sure this is a Task Manager setting right?

        On Thu, Apr 28, 2022 at 11:13 AM John Smith
        <java.dev....@gmail.com> wrote:

            I assume you will take action on your side to track and
            fix the doc? :)

            On Thu, Apr 28, 2022 at 11:12 AM John Smith
            <java.dev....@gmail.com> wrote:

                Ok so to summarize...

                - Build my job jar and have the JDBC driver as a
                compile only dependency and copy the JDBC driver to
                flink lib folder.

                Or

                - Build my job jar and include JDBC driver in the
                shadow, plus copy the JDBC driver in the flink lib
                folder, plus  make an entry in config for
                |classloader.parent-first-patterns-additional|
                
<https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>


                On Thu, Apr 28, 2022 at 10:17 AM Chesnay Schepler
                <ches...@apache.org> wrote:

                    I think what I meant was "either add it to
                    /lib, or [if it is already in /lib but also
                    bundled in the jar] add it to the parent-first
                    patterns."

                    On 28/04/2022 15:56, Chesnay Schepler wrote:

                    Pretty sure, even though I seemingly
                    documented it incorrectly :)

                    On 28/04/2022 15:49, John Smith wrote:

                    You sure?

                     *

                        /JDBC/: JDBC drivers leak references
                        outside the user code classloader. To
                        ensure that these classes are only loaded
                        once you should either add the driver
                        jars to Flink’s |lib/| folder, or add the
                        driver classes to the list of
                        parent-first loaded class via
                        |classloader.parent-first-patterns-additional|
                        
<https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>.

                        It says either or


                    On Wed, Apr 27, 2022 at 3:44 AM Chesnay
                    Schepler <ches...@apache.org> wrote:

                        You're misinterpreting the docs.

                        The parent/child-first classloading
                        controls where Flink looks for a class
                        /first/, specifically whether we first
                        load from /lib or the user-jar.
                        It does not allow you to load something
                        from the user-jar in the parent
                        classloader. That's just not how it works.

                        It must be in /lib.

                        On 27/04/2022 04:59, John Smith wrote:

                        Hi Chesnay as per the docs...
                        
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/

                        You can either put the jars in task
                        manager lib folder or use
                        |classloader.parent-first-patterns-additional|
                        
<https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>

                        I prefer the latter like this: the
                        dependency stays with the user-jar and
                        not on the task manager.

                        On Tue, Apr 26, 2022 at 9:52 PM John
                        Smith <java.dev....@gmail.com> wrote:

                            Ok so I should put the Apache ignite
                            and my Microsoft drivers in the lib
                            folders of my task managers?

                            And then in my job jar only include
                            them as compile time dependencies?


                            On Tue, Apr 26, 2022 at 10:42 AM
                            Chesnay Schepler
                            <ches...@apache.org> wrote:

                                JDBC drivers are well-known for
                                leaking classloaders unfortunately.

                                You have correctly identified
                                your alternatives.

                                You must put the jdbc driver
                                into /lib instead. Setting only
                                the parent-first pattern
                                shouldn't affect anything.
                                That is only relevant if
                                something is in both in /lib and
                                the user-jar, telling Flink to
                                prioritize what is in lib.



                                On 26/04/2022 15:35, John Smith
                                wrote:

                                So I
                                put 
classloader.parent-first-patterns.additional:
                                "org.apache.ignite." in the
                                task config and so far I don't
                                think I'm getting
                                "java.lang.OutOfMemoryError:
                                Metaspace" any more.

                                Or it's too early to tell.

                                Though now, the task managers
                                are shutting down due to some
                                other failures.

                                So maybe because tasks were
                                failing and reloading often the
                                task manager was running out of
                                Metspace. But now maybe it's
                                just cleanly shutting down.

                                On Wed, Apr 20, 2022 at 11:35
                                AM John Smith
                                <java.dev....@gmail.com> wrote:

                                    Or I can put in the config
                                    to treat org.apache.ignite.
                                    classes as first class?

                                    On Tue, Apr 19, 2022 at
                                    10:18 PM John Smith
                                    <java.dev....@gmail.com> wrote:

                                        Ok, so I loaded the
                                        dump into Eclipse Mat
                                        and followed:
                                        
https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks

                                        - On the Histogram, I
                                        got over 30 entries
                                        for: ChildFirstClassLoader
                                        - Then I clicked on one
                                        of them "Merge Shortest
                                        Path..." and picked
                                        "Exclude all
                                        phantom/weak/soft
                                        references"
                                        - Which then gave me:
                                        SqlDriverManager >
                                        Apache Ignite JdbcThin
                                        Driver

                                        So i'm
                                        guessing anything JDBC
                                        based. I should copy
                                        into the task manager
                                        libs folder and my jobs
                                        make the dependencies
                                        as compile only?

                                        On Tue, Apr 19, 2022 at
                                        12:18 PM Yaroslav
                                        Tkachenko
                                        <yaros...@goldsky.io>
                                        wrote:

                                            Also
                                            
https://shopify.engineering/optimizing-apache-flink-applications-tips
                                            might be helpful
                                            (has a section on
                                            profiling, as well
                                            as classloading).

                                            On Tue, Apr 19,
                                            2022 at 4:35 AM
                                            Chesnay Schepler
                                            <ches...@apache.org>
                                            wrote:

                                                We have a very
                                                rough "guide"
                                                in the wiki
                                                (it's just the
                                                specific steps
                                                I took to debug
                                                another leak):
                                                
https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks

                                                On 19/04/2022
                                                12:01, huweihua
                                                wrote:

                                                Hi, John

                                                Sorry for the
                                                late reply.
                                                You can use
                                                MAT[1] to
                                                analyze the
                                                dump file.
                                                Check whether
                                                have too many
                                                loaded classes.

                                                [1]
                                                https://www.eclipse.org/mat/

                                                2022年4月18日
                                                下午9:55，John
                                                Smith
                                                <java.dev....@gmail.com>
                                                写道：

                                                Hi, can
                                                anyone help
                                                with this? I
                                                never looked
                                                at a dump
                                                file before.

                                                On Thu, Apr
                                                14, 2022 at
                                                11:59 AM John
                                                Smith
                                                <java.dev....@gmail.com>
                                                wrote:

                                                    Hi, so I
                                                    have a
                                                    dump
                                                    file.
                                                    What do I
                                                    look for?

                                                    On Thu,
                                                    Mar 31,
                                                    2022 at
                                                    3:28 PM
                                                    John
                                                    Smith
                                                    <java.dev....@gmail.com>
                                                    wrote:

                                                        Ok so
                                                        if
                                                        there's
                                                        a
                                                        leak,
                                                        if I
                                                        manually stop
                                                        the
                                                        job
                                                        and
                                                        restart
                                                        it
                                                        from
                                                        the
                                                        UI
                                                        multiple
                                                        times,
                                                        I
                                                        won't
                                                        see
                                                        the issue
                                                        because
                                                        because
                                                        the
                                                        classes
                                                        are
                                                        unloaded
                                                        correctly?



                                                        On
                                                        Thu,
                                                        Mar
                                                        31,
                                                        2022
                                                        at
                                                        9:20
                                                        AM
                                                        huweihua
                                                        <huweihua....@gmail.com>
                                                        wrote:


                                                            The
                                                            difference
                                                            is
                                                            that
                                                            manually
                                                            canceling
                                                            the
                                                            job
                                                            stops
                                                            the
                                                            JobMaster,
                                                            but
                                                            automatic
                                                            failover
                                                            keeps
                                                            the
                                                            JobMaster
                                                            running.
                                                            But
                                                            looking
                                                            on
                                                            TaskManager,
                                                            it
                                                            doesn't
                                                            make
                                                            much
                                                            difference

                                                            2022年3月31日
                                                            上午4:01，John
                                                            Smith
                                                            
<java.dev....@gmail.com>
                                                            写道：

                                                            Also
                                                            if
                                                            I
                                                            manually
                                                            cancel
                                                            and
                                                            restart
                                                            the
                                                            same
                                                            job
                                                            over
                                                            and
                                                            over
                                                            is
                                                            it
                                                            the
                                                            same
                                                            as
                                                            if
                                                            flink
                                                            was
                                                            restarting
                                                            a
                                                            job
                                                            due
                                                            to
                                                            failure?

                                                            I.e:
                                                            When
                                                            I
                                                            click
                                                            "Cancel
                                                            Job"
                                                            on
                                                            the
                                                            UI
                                                            is
                                                            the
                                                            job
                                                            completely
                                                            unloaded
                                                            vs
                                                            when
                                                            the
                                                            job
                                                            scheduler
                                                            restarts
                                                            a
                                                            job
                                                            because
                                                            if
                                                            whatever
                                                            reason?

                                                            Lile
                                                            this
                                                            I'll
                                                            stop
                                                            and
                                                            restart
                                                            the
                                                            job
                                                            a
                                                            few
                                                            times
                                                            or
                                                            maybe
                                                            I
                                                            can
                                                            trick
                                                            my
                                                            job
                                                            to
                                                            fail
                                                            and
                                                            have
                                                            the
                                                            scheduler
                                                            restart
                                                            it.
                                                            Ok
                                                            let
                                                            me
                                                            think
                                                            about
                                                            this...

                                                            On
                                                            Wed,
                                                            Mar
                                                            30,
                                                            2022
                                                            at
                                                            10:24
                                                            AM
                                                            胡伟华
                                                            
<huweihua....@gmail.com>
                                                            wrote:

                                                                I
                                                                think
                                                                running
                                                                the
                                                                same
                                                                job
                                                                in
                                                                dev
                                                                should
                                                                be
                                                                reproducible,
                                                                maybe
                                                                you
                                                                can
                                                                have
                                                                a
                                                                try.

                                                                 If
                                                                not
                                                                I
                                                                would
                                                                have
                                                                to
                                                                wait
                                                                at
                                                                a
                                                                low
                                                                volume
                                                                time
                                                                to
                                                                do
                                                                it
                                                                on
                                                                production.
                                                                Aldo
                                                                if
                                                                I
                                                                recall
                                                                the
                                                                dump
                                                                is
                                                                as
                                                                big
                                                                as
                                                                the
                                                                JVM
                                                                memory
                                                                right
                                                                so
                                                                if
                                                                I
                                                                have
                                                                10GB
                                                                configed
                                                                for
                                                                the
                                                                JVM
                                                                the
                                                                dump
                                                                will
                                                                be
                                                                10GB
                                                                file?

                                                                Yes,
                                                                JMAP
                                                                will
                                                                pause
                                                                the
                                                                JVM,
                                                                the
                                                                time
                                                                of
                                                                pause
                                                                depends
                                                                on
                                                                the
                                                                size
                                                                to
                                                                dump.
                                                                you
                                                                can
                                                                use
                                                                "jmap
                                                                -dump:live"
                                                                to
                                                                dump
                                                                only
                                                                the
                                                                reachable
                                                                objects,
                                                                this
                                                                will
                                                                take
                                                                a
                                                                brief
                                                                pause

                                                                2022年3月30日
                                                                下午9:47，John
                                                                Smith
                                                                
<java.dev....@gmail.com>
                                                                写道：

                                                                I
                                                                have
                                                                3
                                                                task
                                                                managers
                                                                (see
                                                                config
                                                                below).
                                                                There
                                                                is
                                                                total
                                                                of
                                                                10
                                                                jobs
                                                                with
                                                                25
                                                                slots
                                                                being
                                                                used.
                                                                The
                                                                jobs
                                                                are
                                                                100%
                                                                ETL
                                                                I.e;
                                                                They
                                                                load
                                                                Json,
                                                                transform
                                                                it
                                                                and
                                                                push
                                                                it
                                                                to
                                                                JDBC,
                                                                only
                                                                1
                                                                job
                                                                of
                                                                the
                                                                10
                                                                is
                                                                pushing
                                                                to
                                                                Apache
                                                                Ignite
                                                                cluster.

                                                                FOR
                                                                JMAP.
                                                                I
                                                                know
                                                                that
                                                                it
                                                                will
                                                                pause
                                                                the
                                                                task
                                                                manager.
                                                                So
                                                                if
                                                                I
                                                                run
                                                                the
                                                                same
                                                                jobs
                                                                in
                                                                my
                                                                dev
                                                                env
                                                                will
                                                                I
                                                                still
                                                                be
                                                                able
                                                                to
                                                                see
                                                                the
                                                                similar
                                                                dump?
                                                                I
                                                                I
                                                                assume
                                                                so.
                                                                If
                                                                not
                                                                I
                                                                would
                                                                have
                                                                to
                                                                wait
                                                                at
                                                                a
                                                                low
                                                                volume
                                                                time
                                                                to
                                                                do
                                                                it
                                                                on
                                                                production.
                                                                Aldo
                                                                if
                                                                I
                                                                recall
                                                                the
                                                                dump
                                                                is
                                                                as
                                                                big
                                                                as
                                                                the
                                                                JVM
                                                                memory
                                                                right
                                                                so
                                                                if
                                                                I
                                                                have
                                                                10GB
                                                                configed
                                                                for
                                                                the
                                                                JVM
                                                                the
                                                                dump
                                                                will
                                                                be
                                                                10GB
                                                                file?


                                                                #
                                                                Operating
                                                                system
                                                                has
                                                                16GB
                                                                total.
                                                                env.ssh.opts:
                                                                -l
                                                                flink
                                                                
-oStrictHostKeyChecking=no

                                                                
cluster.evenly-spread-out-slots:
                                                                true

                                                                
taskmanager.memory.flink.size:
                                                                10240m
                                                                
taskmanager.memory.jvm-metaspace.size:
                                                                2048m
                                                                
taskmanager.numberOfTaskSlots:
                                                                16
                                                                
parallelism.default:
                                                                1

                                                                
high-availability:
                                                                zookeeper
                                                                
high-availability.storageDir:
                                                                
file:///mnt/flink/ha/flink_1_14/
                                                                
high-availability.zookeeper.quorum:
                                                                ...
                                                                
high-availability.zookeeper.path.root:
                                                                /flink_1_14
                                                                
high-availability.cluster-id:
                                                                
/flink_1_14_cluster_0001

                                                                web.upload.dir:
                                                                
/mnt/flink/uploads/flink_1_14

                                                                state.backend:
                                                                rocksdb
                                                                
state.backend.incremental:
                                                                true
                                                                
state.checkpoints.dir:
                                                                
file:///mnt/flink/checkpoints/flink_1_14
                                                                
state.savepoints.dir:
                                                                
file:///mnt/flink/savepoints/flink_1_14

                                                                On
                                                                Wed,
                                                                Mar
                                                                30,
                                                                2022
                                                                at
                                                                2:16
                                                                AM
                                                                胡伟华
                                                                
<huweihua....@gmail.com>
                                                                wrote:

                                                                    Hi,
                                                                    John

                                                                    Could
                                                                    you
                                                                    tell
                                                                    us
                                                                    you
                                                                    application
                                                                    scenario?
                                                                    Is
                                                                    it
                                                                    a
                                                                    flink
                                                                    session
                                                                    cluster
                                                                    with
                                                                    a
                                                                    lot
                                                                    of
                                                                    jobs?

                                                                    Maybe
                                                                    you
                                                                    can
                                                                    try
                                                                    to
                                                                    dump
                                                                    the
                                                                    memory
                                                                    with
                                                                    jmap
                                                                    and
                                                                    use
                                                                    tools
                                                                    such
                                                                    as
                                                                    MAT
                                                                    to
                                                                    analyze
                                                                    whether
                                                                    there
                                                                    are
                                                                    abnormal
                                                                    classes
                                                                    and
                                                                    classloaders


                                                                    >
                                                                    2022年3月30日
                                                                    上午6:09，John
                                                                    Smith
                                                                    
<java.dev....@gmail.com>
                                                                    写道：
                                                                    >

                                                                    >
                                                                    Hi
                                                                    running
                                                                    1.14.4
                                                                    >

                                                                    >
                                                                    My
                                                                    tasks
                                                                    manager
                                                                    still
                                                                    fails
                                                                    with
                                                                    
java.lang.OutOfMemoryError:
                                                                    Metaspace.
                                                                    The
                                                                    metaspace
                                                                    
out-of-memory
                                                                    error
                                                                    has
                                                                    occurred.
                                                                    This
                                                                    can
                                                                    mean
                                                                    two
                                                                    things:
                                                                    either
                                                                    the
                                                                    job
                                                                    requires
                                                                    a
                                                                    larger
                                                                    size
                                                                    of
                                                                    JVM
                                                                    metaspace
                                                                    to
                                                                    load
                                                                    classes
                                                                    or
                                                                    there
                                                                    is
                                                                    a
                                                                    class
                                                                    loading
                                                                    leak.
                                                                    >

                                                                    >
                                                                    I
                                                                    have
                                                                    2GB
                                                                    of
                                                                    metaspace
                                                                    configed
                                                                    
taskmanager.memory.jvm-metaspace.size:
                                                                    2048m
                                                                    >

                                                                    >
                                                                    But
                                                                    the
                                                                    task
                                                                    nodes
                                                                    still
                                                                    fail.
                                                                    >

                                                                    >
                                                                    When
                                                                    looking
                                                                    at
                                                                    the
                                                                    UI
                                                                    metrics,
                                                                    the
                                                                    metaspace
                                                                    starts
                                                                    low.
                                                                    Now
                                                                    I
                                                                    see
                                                                    85%
                                                                    usage.
                                                                    It
                                                                    seems
                                                                    to
                                                                    be
                                                                    a
                                                                    class
                                                                    loading
                                                                    leak
                                                                    at
                                                                    this
                                                                    point,
                                                                    how
                                                                    can
                                                                    we
                                                                    debug
                                                                    this
                                                                    issue?

Re: How to debug Metaspace exception?

Reply via email to