The exclusions should not have any impact on that, because what defines
which classloader will load which class is not the presence or
particular class in a specific jar, but the configuration of
parent-first-patterns [1].
If you don't use any flink internal imports, than it still might be the
case, that a class in any of the packages defined by the
parent-first-pattern to hold reference to your user-code classes, which
would cause the leak. I'd recommend to inspect the heap dump after
several restarts of the application and look for reference to Class
objects from the root set.
Jan
[1]
https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#class-loading
On 11/16/20 5:34 PM, Flavio Pompermaier wrote:
I've tried to remove all possible imports of classes not contained in
the fat jar but I still face the same problem.
I've also tried to reduce as much as possible the exclude in the shade
section of the maven plugin (I took the one at [1]) so now I exclude
only few dependencies..could it be that I should include org.slf4j:*
if I use static import of it?
<artifactSet>
<excludes>
<exclude>com.google.code.findbugs:jsr305</exclude>
<exclude>org.slf4j:*</exclude>
<exclude>log4j:*</exclude>
</excludes>
</artifactSet>
[1]
https://ci.apache.org/projects/flink/flink-docs-master/dev/project-configuration.html#appendix-template-for-building-a-jar-with-dependencies
On Mon, Nov 16, 2020 at 3:29 PM Jan Lukavský <je...@seznam.cz
<mailto:je...@seznam.cz>> wrote:
Yes, that could definitely cause this. You should probably avoid
using these flink-internal shaded classes and ship your own
versions (not shaded).
Best,
Jan
On 11/16/20 3:22 PM, Flavio Pompermaier wrote:
Thank you Jan for your valuable feedback.
Could it be that I should not use import shaded-jackson classes
in my user code?
For example import
org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ObjectMapper?
Bets,
Flavio
On Mon, Nov 16, 2020 at 3:15 PM Jan Lukavský <je...@seznam.cz
<mailto:je...@seznam.cz>> wrote:
Hi Flavio,
when I encountered quite similar problem that you describe,
it was related to a static storage located in class that was
loaded "parent-first". In my case it was it was in
java.lang.ClassValue, but it might (and probably will be)
different in your case. The problem is that if user-code
registers something in some (static) storage located in class
loaded with parent (TaskTracker) classloader, then its
associated classes will never be GC'd and Metaspace will
grow. A good starting point would be not to focus on biggest
consumers of heap (in general), but to look at where the 15k
objects of type Class are referenced from. That might help
you figure this out. I'm not sure if there is something that
can be done in general to prevent this type of leaks. That
would be probably question on dev@ mailing list.
Best,
Jan
On 11/16/20 2:27 PM, Flavio Pompermaier wrote:
Hello everybody,
I was writing this email when a similar thread on this
mailing list appeared..
The difference is that the other problem seems to be related
with Flink 1.10 on YARN and does not output anything helpful
in debugging the cause of the problem.
Indeed, in my use case I use Flink 1.11.0 and Flink on a
standalone session cluster (the job is submitted to the
cluster using the CLI client).
The problem arises when I submit the same job for about 20
times (this number unfortunately is not deterministic and
can change a little bit). The error reported by the Task
Executor is related to the ever growing Metaspace..the error
seems to be pretty detailed [1].
I found the same issue in some previous threads on this
mailing list and I've tried to figure it out the cause of
the problem. The issue is that looking at the objects
allocated I don't really get an idea of the source of the
problem because the type of objects that are consuming the
memory are of general purpose (i.e. Bytes, Integers and
Strings)...these are my "top" memory consumers if looking at
the output of jmap -histo <PID>:
At run 0:
num #instances #bytes class name (module)
-------------------------------------------------------
1: 46238 13224056 [B (java.base@11.0.9.1
<mailto:java.base@11.0.9.1>)
2: 3736 6536672 [I (java.base@11.0.9.1
<mailto:java.base@11.0.9.1>)
3: 38081 913944 java.lang.String
(java.base@11.0.9.1 <mailto:java.base@11.0.9.1>)
4: 26 852384
[Lakka.dispatch.forkjoin.ForkJoinTask;
5: 7146 844984 java.lang.Class
(java.base@11.0.9.1 <mailto:java.base@11.0.9.1>)
At run 1:
1: 77.608 25.317.496 [B
(java.base@11.0.9.1 <mailto:java.base@11.0.9.1>)
2: 7.004 9.088.360 [I
(java.base@11.0.9.1 <mailto:java.base@11.0.9.1>)
3: 15.814 1.887.256 java.lang.Class
(java.base@11.0.9.1 <mailto:java.base@11.0.9.1>)
4: 67.381 1.617.144 java.lang.String
(java.base@11.0.9.1 <mailto:java.base@11.0.9.1>)
5: 3.906 1.422.960
[Ljava.util.HashMap$Node; (java.base@11.0.9.1
<mailto:java.base@11.0.9.1>)
At run 6:
1: 81.408 25.375.400 [B
(java.base@11.0.9.1 <mailto:java.base@11.0.9.1>)
2: 12.479 7.249.392 [I
(java.base@11.0.9.1 <mailto:java.base@11.0.9.1>)
3: 29.090 3.496.168 java.lang.Class
(java.base@11.0.9.1 <mailto:java.base@11.0.9.1>)
4: 4.347 2.813.416
[Ljava.util.HashMap$Node; (java.base@11.0.9.1
<mailto:java.base@11.0.9.1>)
5: 71.584 1.718.016 java.lang.String
(java.base@11.0.9.1 <mailto:java.base@11.0.9.1>)
At run 8:
1: 985.979 127.193.256 [B
(java.base@11.0.9.1 <mailto:java.base@11.0.9.1>)
2: 35.400 13.702.112 [I
(java.base@11.0.9.1 <mailto:java.base@11.0.9.1>)
3: 260.387 6.249.288 java.lang.String
(java.base@11.0.9.1 <mailto:java.base@11.0.9.1>)
4: 148.836 5.953.440
java.util.HashMap$KeyIterator (java.base@11.0.9.1
<mailto:java.base@11.0.9.1>)
5: 17.641 5.222.344
[Ljava.util.HashMap$Node; (java.base@11.0.9.1
<mailto:java.base@11.0.9.1>)
Thanks in advance for any help,
Flavio
[1]
--------------------------------------------------------------------------------------------------
java.lang.OutOfMemoryError: Metaspace. The metaspace
out-of-memory error has occurred. This can mean two things:
either the job requires a larger size of JVM metaspace to
load classes or there is a class loading leak. In the first
case 'taskmanager.memory.jvm-metaspace.size' configuration
option should be increased. If the error persists (usually
in cluster after several job (re-)submissions) then there is
probably a class loading leak in user code or some of its
dependencies which has to be investigated and fixed. The
task executor has to be shutdown...
at java.lang.ClassLoader.defineClass1(Native Method)
~[?:?]
at
java.lang.ClassLoader.defineClass(ClassLoader.java:1017) ~[?:?]
at
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:174)
~[?:?]
at
java.net.URLClassLoader.defineClass(URLClassLoader.java:550)
~[?:?]
at
java.net.URLClassLoader$1.run(URLClassLoader.java:458) ~[?:?]
at
java.net.URLClassLoader$1.run(URLClassLoader.java:452) ~[?:?]
at
java.security.AccessController.doPrivileged(Native Method)
~[?:?]
at
java.net.URLClassLoader.findClass(URLClassLoader.java:451)
~[?:?]
at
org.apache.flink.util.ChildFirstClassLoader.loadClassWithoutExceptionHandling(ChildFirstClassLoader.java:71)
~[flink-dist_2.12-1.11.0.jar:1.11.0]
at
org.apache.flink.util.FlinkUserCodeClassLoader.loadClass(FlinkUserCodeClassLoader.java:48)
[flink-dist_2.12-1.11.0.jar:1.11.0]
at
java.lang.ClassLoader.loadClass(ClassLoader.java:522) [?:?]