The exclusions should not have any impact on that, because what defines which classloader will load which class is not the presence or particular class in a specific jar, but the configuration of parent-first-patterns [1].

If you don't use any flink internal imports, than it still might be the case, that a class in any of the packages defined by the parent-first-pattern to hold reference to your user-code classes, which would cause the leak. I'd recommend to inspect the heap dump after several restarts of the application and look for reference to Class objects from the root set.

Jan

[1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#class-loading

On 11/16/20 5:34 PM, Flavio Pompermaier wrote:
I've tried to remove all possible imports of classes not contained in the fat jar but I still face the same problem. I've also tried to reduce as much as possible the exclude in the shade section of the maven plugin (I took the one at [1]) so now I exclude only few dependencies..could it be that I should include org.slf4j:* if I use static import of it?

<artifactSet>
    <excludes>
<exclude>com.google.code.findbugs:jsr305</exclude>
      <exclude>org.slf4j:*</exclude>
      <exclude>log4j:*</exclude>
    </excludes>
</artifactSet>

[1] https://ci.apache.org/projects/flink/flink-docs-master/dev/project-configuration.html#appendix-template-for-building-a-jar-with-dependencies

On Mon, Nov 16, 2020 at 3:29 PM Jan Lukavský <je...@seznam.cz <mailto:je...@seznam.cz>> wrote:

    Yes, that could definitely cause this. You should probably avoid
    using these flink-internal shaded classes and ship your own
    versions (not shaded).

    Best,

     Jan

    On 11/16/20 3:22 PM, Flavio Pompermaier wrote:
    Thank you Jan for your valuable feedback.
    Could it be that I should not use import shaded-jackson classes
    in my user code?
    For example import
    
org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ObjectMapper?

    Bets,
    Flavio

    On Mon, Nov 16, 2020 at 3:15 PM Jan Lukavský <je...@seznam.cz
    <mailto:je...@seznam.cz>> wrote:

        Hi Flavio,

        when I encountered quite similar problem that you describe,
        it was related to a static storage located in class that was
        loaded "parent-first". In my case it was it was in
        java.lang.ClassValue, but it might (and probably will be)
        different in your case. The problem is that if user-code
        registers something in some (static) storage located in class
        loaded with parent (TaskTracker) classloader, then its
        associated classes will never be GC'd and Metaspace will
        grow. A good starting point would be not to focus on biggest
        consumers of heap (in general), but to look at where the 15k
        objects of type Class are referenced from. That might help
        you figure this out. I'm not sure if there is something that
        can be done in general to prevent this type of leaks. That
        would be probably question on dev@ mailing list.

        Best,

         Jan

        On 11/16/20 2:27 PM, Flavio Pompermaier wrote:
        Hello everybody,
        I was writing this email when a similar thread on this
        mailing list appeared..
        The difference is that the other problem seems to be related
        with Flink 1.10 on YARN and does not output anything helpful
        in debugging the cause of the problem.

        Indeed, in my use case I use Flink 1.11.0 and Flink on a
        standalone session cluster (the job is submitted to the
        cluster using the CLI client).
        The problem arises when I submit the same job for about 20
        times (this number unfortunately is not deterministic and
        can change a little bit). The error reported by the Task
        Executor is related to the ever growing Metaspace..the error
        seems to be pretty detailed [1].

        I found the same issue in some previous threads on this
        mailing list and I've tried to figure it out the cause of
        the problem. The issue is that looking at the objects
        allocated I don't really get an idea of the source of the
        problem because the type of objects that are consuming the
        memory are of general purpose (i.e. Bytes, Integers and
        Strings)...these are my "top" memory consumers if looking at
        the output of  jmap -histo <PID>:

        At run 0:

         num     #instances         #bytes  class name (module)
        -------------------------------------------------------
           1:         46238       13224056  [B (java.base@11.0.9.1
        <mailto:java.base@11.0.9.1>)
           2:          3736        6536672  [I (java.base@11.0.9.1
        <mailto:java.base@11.0.9.1>)
           3:         38081         913944  java.lang.String
        (java.base@11.0.9.1 <mailto:java.base@11.0.9.1>)
           4:            26         852384
         [Lakka.dispatch.forkjoin.ForkJoinTask;
           5:          7146         844984  java.lang.Class
        (java.base@11.0.9.1 <mailto:java.base@11.0.9.1>)

        At run 1:

           1:         77.608       25.317.496  [B
        (java.base@11.0.9.1 <mailto:java.base@11.0.9.1>)
           2:          7.004        9.088.360  [I
        (java.base@11.0.9.1 <mailto:java.base@11.0.9.1>)
           3:         15.814        1.887.256  java.lang.Class
        (java.base@11.0.9.1 <mailto:java.base@11.0.9.1>)
           4:         67.381        1.617.144  java.lang.String
        (java.base@11.0.9.1 <mailto:java.base@11.0.9.1>)
           5:          3.906        1.422.960
         [Ljava.util.HashMap$Node; (java.base@11.0.9.1
        <mailto:java.base@11.0.9.1>)

        At run 6:

           1:         81.408       25.375.400  [B
        (java.base@11.0.9.1 <mailto:java.base@11.0.9.1>)
           2:         12.479        7.249.392  [I
        (java.base@11.0.9.1 <mailto:java.base@11.0.9.1>)
           3:         29.090        3.496.168  java.lang.Class
        (java.base@11.0.9.1 <mailto:java.base@11.0.9.1>)
           4:          4.347        2.813.416
         [Ljava.util.HashMap$Node; (java.base@11.0.9.1
        <mailto:java.base@11.0.9.1>)
           5:         71.584        1.718.016  java.lang.String
        (java.base@11.0.9.1 <mailto:java.base@11.0.9.1>)

        At run 8:

           1:        985.979      127.193.256  [B
        (java.base@11.0.9.1 <mailto:java.base@11.0.9.1>)
           2:         35.400       13.702.112  [I
        (java.base@11.0.9.1 <mailto:java.base@11.0.9.1>)
           3:        260.387        6.249.288  java.lang.String
        (java.base@11.0.9.1 <mailto:java.base@11.0.9.1>)
           4:        148.836        5.953.440
         java.util.HashMap$KeyIterator (java.base@11.0.9.1
        <mailto:java.base@11.0.9.1>)
           5:         17.641        5.222.344
         [Ljava.util.HashMap$Node; (java.base@11.0.9.1
        <mailto:java.base@11.0.9.1>)

        Thanks in advance for any help,
        Flavio

        [1]
        
--------------------------------------------------------------------------------------------------
        java.lang.OutOfMemoryError: Metaspace. The metaspace
        out-of-memory error has occurred. This can mean two things:
        either the job requires a larger size of JVM metaspace to
        load classes or there is a class loading leak. In the first
        case 'taskmanager.memory.jvm-metaspace.size' configuration
        option should be increased. If the error persists (usually
        in cluster after several job (re-)submissions) then there is
        probably a class loading leak in user code or some of its
        dependencies which has to be investigated and fixed. The
        task executor has to be shutdown...
                at java.lang.ClassLoader.defineClass1(Native Method)
        ~[?:?]
                at
        java.lang.ClassLoader.defineClass(ClassLoader.java:1017) ~[?:?]
                at
        java.security.SecureClassLoader.defineClass(SecureClassLoader.java:174)
        ~[?:?]
                at
        java.net.URLClassLoader.defineClass(URLClassLoader.java:550)
        ~[?:?]
                at
        java.net.URLClassLoader$1.run(URLClassLoader.java:458) ~[?:?]
                at
        java.net.URLClassLoader$1.run(URLClassLoader.java:452) ~[?:?]
                at
        java.security.AccessController.doPrivileged(Native Method)
        ~[?:?]
                at
        java.net.URLClassLoader.findClass(URLClassLoader.java:451)
        ~[?:?]
                at
        
org.apache.flink.util.ChildFirstClassLoader.loadClassWithoutExceptionHandling(ChildFirstClassLoader.java:71)
        ~[flink-dist_2.12-1.11.0.jar:1.11.0]
                at
        
org.apache.flink.util.FlinkUserCodeClassLoader.loadClass(FlinkUserCodeClassLoader.java:48)
        [flink-dist_2.12-1.11.0.jar:1.11.0]
                at
        java.lang.ClassLoader.loadClass(ClassLoader.java:522) [?:?]

Reply via email to