Re: How to debug Metaspace exception?

John Smith Tue, 26 Apr 2022 20:00:04 -0700

Hi Chesnay as per the docs...
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/


You can either put the jars in task manager lib folder or use
classloader.parent-first-patterns-additional
<https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>

I prefer the latter like this: the dependency stays with the user-jar and
not on the task manager.

On Tue, Apr 26, 2022 at 9:52 PM John Smith <java.dev....@gmail.com> wrote:

> Ok so I should put the Apache ignite and my Microsoft drivers in the lib
> folders of my task managers?
>
> And then in my job jar only include them as compile time dependencies?
>
>
> On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler <ches...@apache.org>
> wrote:
>
>> JDBC drivers are well-known for leaking classloaders unfortunately.
>>
>> You have correctly identified your alternatives.
>>
>> You must put the jdbc driver into /lib instead. Setting only the
>> parent-first pattern shouldn't affect anything.
>> That is only relevant if something is in both in /lib and the user-jar,
>> telling Flink to prioritize what is in lib.
>>
>>
>>
>> On 26/04/2022 15:35, John Smith wrote:
>>
>> So I put classloader.parent-first-patterns.additional:
>> "org.apache.ignite." in the task config and so far I don't think I'm
>> getting "java.lang.OutOfMemoryError: Metaspace" any more.
>>
>> Or it's too early to tell.
>>
>> Though now, the task managers are shutting down due to some
>> other failures.
>>
>> So maybe because tasks were failing and reloading often the task manager
>> was running out of Metspace. But now maybe it's just cleanly shutting down.
>>
>> On Wed, Apr 20, 2022 at 11:35 AM John Smith <java.dev....@gmail.com>
>> wrote:
>>
>>> Or I can put in the config to treat org.apache.ignite. classes as first
>>> class?
>>>
>>> On Tue, Apr 19, 2022 at 10:18 PM John Smith <java.dev....@gmail.com>
>>> wrote:
>>>
>>>> Ok, so I loaded the dump into Eclipse Mat and followed:
>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>>
>>>> - On the Histogram, I got over 30 entries for: ChildFirstClassLoader
>>>> - Then I clicked on one of them "Merge Shortest Path..." and picked
>>>> "Exclude all phantom/weak/soft references"
>>>> - Which then gave me: SqlDriverManager > Apache Ignite JdbcThin Driver
>>>>
>>>> So i'm guessing anything JDBC based. I should copy into the task
>>>> manager libs folder and my jobs make the dependencies as compile only?
>>>>
>>>> On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko <
>>>> yaros...@goldsky.io> wrote:
>>>>
>>>>> Also
>>>>> https://shopify.engineering/optimizing-apache-flink-applications-tips
>>>>> might be helpful (has a section on profiling, as well as classloading).
>>>>>
>>>>> On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler <ches...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> We have a very rough "guide" in the wiki (it's just the specific
>>>>>> steps I took to debug another leak):
>>>>>>
>>>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>>>>
>>>>>> On 19/04/2022 12:01, huweihua wrote:
>>>>>>
>>>>>> Hi, John
>>>>>>
>>>>>> Sorry for the late reply. You can use MAT[1] to analyze the dump
>>>>>> file. Check whether have too many loaded classes.
>>>>>>
>>>>>> [1] https://www.eclipse.org/mat/
>>>>>>
>>>>>> 2022年4月18日 下午9:55，John Smith <java.dev....@gmail.com> 写道：
>>>>>>
>>>>>> Hi, can anyone help with this? I never looked at a dump file before.
>>>>>>
>>>>>> On Thu, Apr 14, 2022 at 11:59 AM John Smith <java.dev....@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi, so I have a dump file. What do I look for?
>>>>>>>
>>>>>>> On Thu, Mar 31, 2022 at 3:28 PM John Smith <java.dev....@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Ok so if there's a leak, if I manually stop the job and restart it
>>>>>>>> from the UI multiple times, I won't see the issue because because the
>>>>>>>> classes are unloaded correctly?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Mar 31, 2022 at 9:20 AM huweihua <huweihua....@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> The difference is that manually canceling the job stops the
>>>>>>>>> JobMaster, but automatic failover keeps the JobMaster running. But 
>>>>>>>>> looking
>>>>>>>>> on TaskManager, it doesn't make much difference
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2022年3月31日 上午4:01，John Smith <java.dev....@gmail.com> 写道：
>>>>>>>>>
>>>>>>>>> Also if I manually cancel and restart the same job over and over
>>>>>>>>> is it the same as if flink was restarting a job due to failure?
>>>>>>>>>
>>>>>>>>> I.e: When I click "Cancel Job" on the UI is the job completely
>>>>>>>>> unloaded vs when the job scheduler restarts a job because if whatever
>>>>>>>>> reason?
>>>>>>>>>
>>>>>>>>> Lile this I'll stop and restart the job a few times or maybe I can
>>>>>>>>> trick my job to fail and have the scheduler restart it. Ok let me 
>>>>>>>>> think
>>>>>>>>> about this...
>>>>>>>>>
>>>>>>>>> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 <huweihua....@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> So if I run the same jobs in my dev env will I still be able to
>>>>>>>>>> see the similar dump?
>>>>>>>>>>
>>>>>>>>>> I think running the same job in dev should be reproducible, maybe
>>>>>>>>>> you can have a try.
>>>>>>>>>>
>>>>>>>>>>  If not I would have to wait at a low volume time to do it on
>>>>>>>>>> production. Aldo if I recall the dump is as big as the JVM memory 
>>>>>>>>>> right so
>>>>>>>>>> if I have 10GB configed for the JVM the dump will be 10GB file?
>>>>>>>>>>
>>>>>>>>>> Yes, JMAP will pause the JVM, the time of pause depends on the
>>>>>>>>>> size to dump. you can use "jmap -dump:live" to dump only the 
>>>>>>>>>> reachable
>>>>>>>>>> objects, this will take a brief pause
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2022年3月30日 下午9:47，John Smith <java.dev....@gmail.com> 写道：
>>>>>>>>>>
>>>>>>>>>> I have 3 task managers (see config below). There is total of 10
>>>>>>>>>> jobs with 25 slots being used.
>>>>>>>>>> The jobs are 100% ETL I.e; They load Json, transform it and push
>>>>>>>>>> it to JDBC, only 1 job of the 10 is pushing to Apache Ignite cluster.
>>>>>>>>>>
>>>>>>>>>> FOR JMAP. I know that it will pause the task manager. So if I run
>>>>>>>>>> the same jobs in my dev env will I still be able to see the similar 
>>>>>>>>>> dump? I
>>>>>>>>>> I assume so. If not I would have to wait at a low volume time to do 
>>>>>>>>>> it on
>>>>>>>>>> production. Aldo if I recall the dump is as big as the JVM memory 
>>>>>>>>>> right so
>>>>>>>>>> if I have 10GB configed for the JVM the dump will be 10GB file?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> # Operating system has 16GB total.
>>>>>>>>>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no
>>>>>>>>>>
>>>>>>>>>> cluster.evenly-spread-out-slots: true
>>>>>>>>>>
>>>>>>>>>> taskmanager.memory.flink.size: 10240m
>>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>>>>>>> taskmanager.numberOfTaskSlots: 16
>>>>>>>>>> parallelism.default: 1
>>>>>>>>>>
>>>>>>>>>> high-availability: zookeeper
>>>>>>>>>> high-availability.storageDir: file:///mnt/flink/ha/flink_1_14/
>>>>>>>>>> high-availability.zookeeper.quorum: ...
>>>>>>>>>> high-availability.zookeeper.path.root: /flink_1_14
>>>>>>>>>> high-availability.cluster-id: /flink_1_14_cluster_0001
>>>>>>>>>>
>>>>>>>>>> web.upload.dir: /mnt/flink/uploads/flink_1_14
>>>>>>>>>>
>>>>>>>>>> state.backend: rocksdb
>>>>>>>>>> state.backend.incremental: true
>>>>>>>>>> state.checkpoints.dir: file:///mnt/flink/checkpoints/flink_1_14
>>>>>>>>>> state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华 <huweihua....@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi, John
>>>>>>>>>>>
>>>>>>>>>>> Could you tell us you application scenario? Is it a flink
>>>>>>>>>>> session cluster with a lot of jobs?
>>>>>>>>>>>
>>>>>>>>>>> Maybe you can try to dump the memory with jmap and use tools
>>>>>>>>>>> such as MAT to analyze whether there are abnormal classes and 
>>>>>>>>>>> classloaders
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> > 2022年3月30日 上午6:09，John Smith <java.dev....@gmail.com> 写道：
>>>>>>>>>>> >
>>>>>>>>>>> > Hi running 1.14.4
>>>>>>>>>>> >
>>>>>>>>>>> > My tasks manager still fails with java.lang.OutOfMemoryError:
>>>>>>>>>>> Metaspace. The metaspace out-of-memory error has occurred. This can 
>>>>>>>>>>> mean
>>>>>>>>>>> two things: either the job requires a larger size of JVM metaspace 
>>>>>>>>>>> to load
>>>>>>>>>>> classes or there is a class loading leak.
>>>>>>>>>>> >
>>>>>>>>>>> > I have 2GB of metaspace configed
>>>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>>>>>>>> >
>>>>>>>>>>> > But the task nodes still fail.
>>>>>>>>>>> >
>>>>>>>>>>> > When looking at the UI metrics, the metaspace starts low. Now
>>>>>>>>>>> I see 85% usage. It seems to be a class loading leak at this point, 
>>>>>>>>>>> how can
>>>>>>>>>>> we debug this issue?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>>>
>>

Re: How to debug Metaspace exception?

Reply via email to