Re: Random Task executor shutdown (java.lang.OutOfMemoryError: Metaspace)

Flavio Pompermaier Mon, 16 Nov 2020 08:35:23 -0800

I've tried to remove all possible imports of classes not contained in the
fat jar but I still face the same problem.
I've also tried to reduce as much as possible the exclude in the shade
section of the maven plugin (I took the one at [1]) so now I exclude only
few dependencies..could it be that I should include org.slf4j:* if I use
static import of it?


<artifactSet>
    <excludes>
      <exclude>com.google.code.findbugs:jsr305</exclude>
      <exclude>org.slf4j:*</exclude>
      <exclude>log4j:*</exclude>
    </excludes>
</artifactSet>

[1]
https://ci.apache.org/projects/flink/flink-docs-master/dev/project-configuration.html#appendix-template-for-building-a-jar-with-dependencies

On Mon, Nov 16, 2020 at 3:29 PM Jan Lukavský <je...@seznam.cz> wrote:

> Yes, that could definitely cause this. You should probably avoid using
> these flink-internal shaded classes and ship your own versions (not shaded).
>
> Best,
>
>  Jan
> On 11/16/20 3:22 PM, Flavio Pompermaier wrote:
>
> Thank you Jan for your valuable feedback.
> Could it be that I should not use import shaded-jackson classes in my user
> code?
> For example import
> org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ObjectMapper?
>
> Bets,
> Flavio
>
> On Mon, Nov 16, 2020 at 3:15 PM Jan Lukavský <je...@seznam.cz> wrote:
>
>> Hi Flavio,
>>
>> when I encountered quite similar problem that you describe, it was
>> related to a static storage located in class that was loaded
>> "parent-first". In my case it was it was in java.lang.ClassValue, but it
>> might (and probably will be) different in your case. The problem is that if
>> user-code registers something in some (static) storage located in class
>> loaded with parent (TaskTracker) classloader, then its associated classes
>> will never be GC'd and Metaspace will grow. A good starting point would be
>> not to focus on biggest consumers of heap (in general), but to look at
>> where the 15k objects of type Class are referenced from. That might help
>> you figure this out. I'm not sure if there is something that can be done in
>> general to prevent this type of leaks. That would be probably question on
>> dev@ mailing list.
>>
>> Best,
>>
>>  Jan
>> On 11/16/20 2:27 PM, Flavio Pompermaier wrote:
>>
>> Hello everybody,
>> I was writing this email when a similar thread on this mailing list
>> appeared..
>> The difference is that the other problem seems to be related with Flink
>> 1.10 on YARN and does not output anything helpful in debugging the cause of
>> the problem.
>>
>> Indeed, in my use case I use Flink 1.11.0 and Flink on a standalone
>> session cluster (the job is submitted to the cluster using the CLI client).
>> The problem arises when I submit the same job for about 20 times (this
>> number unfortunately is not deterministic and can change a little bit). The
>> error reported by the Task Executor is related to the ever growing
>> Metaspace..the error seems to be pretty detailed [1].
>>
>> I found the same issue in some previous threads on this mailing list and
>> I've tried to figure it out the cause of the problem. The issue is that
>> looking at the objects allocated I don't really get an idea of the source
>> of the problem because the type of objects that are consuming the memory
>> are of general purpose (i.e. Bytes, Integers and Strings)...these are my
>> "top" memory consumers if looking at the output of  jmap -histo <PID>:
>>
>> At run 0:
>>
>>  num     #instances         #bytes  class name (module)
>> -------------------------------------------------------
>>    1:         46238       13224056  [B (java.base@11.0.9.1)
>>    2:          3736        6536672  [I (java.base@11.0.9.1)
>>    3:         38081         913944  java.lang.String (java.base@11.0.9.1)
>>    4:            26         852384  [Lakka.dispatch.forkjoin.ForkJoinTask;
>>    5:          7146         844984  java.lang.Class (java.base@11.0.9.1)
>>
>> At run 1:
>>
>>    1:         77.608       25.317.496  [B (java.base@11.0.9.1)
>>    2:          7.004        9.088.360  [I (java.base@11.0.9.1)
>>    3:         15.814        1.887.256  java.lang.Class (
>> java.base@11.0.9.1)
>>    4:         67.381        1.617.144  java.lang.String (
>> java.base@11.0.9.1)
>>    5:          3.906        1.422.960  [Ljava.util.HashMap$Node; (
>> java.base@11.0.9.1)
>>
>> At run 6:
>>
>>    1:         81.408       25.375.400  [B (java.base@11.0.9.1)
>>    2:         12.479        7.249.392  [I (java.base@11.0.9.1)
>>    3:         29.090        3.496.168  java.lang.Class (
>> java.base@11.0.9.1)
>>    4:          4.347        2.813.416  [Ljava.util.HashMap$Node; (
>> java.base@11.0.9.1)
>>    5:         71.584        1.718.016  java.lang.String (
>> java.base@11.0.9.1)
>>
>> At run 8:
>>
>>    1:        985.979      127.193.256  [B (java.base@11.0.9.1)
>>    2:         35.400       13.702.112  [I (java.base@11.0.9.1)
>>    3:        260.387        6.249.288  java.lang.String (
>> java.base@11.0.9.1)
>>    4:        148.836        5.953.440  java.util.HashMap$KeyIterator (
>> java.base@11.0.9.1)
>>    5:         17.641        5.222.344  [Ljava.util.HashMap$Node; (
>> java.base@11.0.9.1)
>>
>> Thanks in advance for any help,
>> Flavio
>>
>> [1]
>> --------------------------------------------------------------------------------------------------
>> java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error
>> has occurred. This can mean two things: either the job requires a larger
>> size of JVM metaspace to load classes or there is a class loading leak. In
>> the first case 'taskmanager.memory.jvm-metaspace.size' configuration option
>> should be increased. If the error persists (usually in cluster after
>> several job (re-)submissions) then there is probably a class loading leak
>> in user code or some of its dependencies which has to be investigated and
>> fixed. The task executor has to be shutdown...
>>         at java.lang.ClassLoader.defineClass1(Native Method) ~[?:?]
>>         at java.lang.ClassLoader.defineClass(ClassLoader.java:1017) ~[?:?]
>>         at
>> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:174)
>> ~[?:?]
>>         at java.net.URLClassLoader.defineClass(URLClassLoader.java:550)
>> ~[?:?]
>>         at java.net.URLClassLoader$1.run(URLClassLoader.java:458) ~[?:?]
>>         at java.net.URLClassLoader$1.run(URLClassLoader.java:452) ~[?:?]
>>         at java.security.AccessController.doPrivileged(Native Method)
>> ~[?:?]
>>         at java.net.URLClassLoader.findClass(URLClassLoader.java:451)
>> ~[?:?]
>>         at
>> org.apache.flink.util.ChildFirstClassLoader.loadClassWithoutExceptionHandling(ChildFirstClassLoader.java:71)
>> ~[flink-dist_2.12-1.11.0.jar:1.11.0]
>>         at
>> org.apache.flink.util.FlinkUserCodeClassLoader.loadClass(FlinkUserCodeClassLoader.java:48)
>> [flink-dist_2.12-1.11.0.jar:1.11.0]
>>         at java.lang.ClassLoader.loadClass(ClassLoader.java:522) [?:?]
>>
>>

Re: Random Task executor shutdown (java.lang.OutOfMemoryError: Metaspace)

Reply via email to