Re: How to debug Metaspace exception?

huweihua Tue, 19 Apr 2022 03:02:07 -0700

Hi, John

Sorry for the late reply. You can use MAT[1] to analyze the dump file. Check 
whether have too many loaded classes.


[1] https://www.eclipse.org/mat/

> 2022年4月18日 下午9:55，John Smith <java.dev....@gmail.com> 写道：
> 
> Hi, can anyone help with this? I never looked at a dump file before.
> 
> On Thu, Apr 14, 2022 at 11:59 AM John Smith <java.dev....@gmail.com 
> <mailto:java.dev....@gmail.com>> wrote:
> Hi, so I have a dump file. What do I look for?
> 
> On Thu, Mar 31, 2022 at 3:28 PM John Smith <java.dev....@gmail.com 
> <mailto:java.dev....@gmail.com>> wrote:
> Ok so if there's a leak, if I manually stop the job and restart it from the 
> UI multiple times, I won't see the issue because because the classes are 
> unloaded correctly?
> 
> 
> On Thu, Mar 31, 2022 at 9:20 AM huweihua <huweihua....@gmail.com 
> <mailto:huweihua....@gmail.com>> wrote:
> 
> The difference is that manually canceling the job stops the JobMaster, but 
> automatic failover keeps the JobMaster running. But looking on TaskManager, 
> it doesn't make much difference
> 
> 
>> 2022年3月31日 上午4:01，John Smith <java.dev....@gmail.com 
>> <mailto:java.dev....@gmail.com>> 写道：
>> 
>> Also if I manually cancel and restart the same job over and over is it the 
>> same as if flink was restarting a job due to failure?
>> 
>> I.e: When I click "Cancel Job" on the UI is the job completely unloaded vs 
>> when the job scheduler restarts a job because if whatever reason?
>> 
>> Lile this I'll stop and restart the job a few times or maybe I can trick my 
>> job to fail and have the scheduler restart it. Ok let me think about this...
>> 
>> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 <huweihua....@gmail.com 
>> <mailto:huweihua....@gmail.com>> wrote:
>>> So if I run the same jobs in my dev env will I still be able to see the 
>>> similar dump? 
>> I think running the same job in dev should be reproducible, maybe you can 
>> have a try.
>> 
>>>  If not I would have to wait at a low volume time to do it on production. 
>>> Aldo if I recall the dump is as big as the JVM memory right so if I have 
>>> 10GB configed for the JVM the dump will be 10GB file?
>> 
>> Yes, JMAP will pause the JVM, the time of pause depends on the size to dump. 
>> you can use "jmap -dump:live" to dump only the reachable objects, this will 
>> take a brief pause
>> 
>> 
>> 
>>> 2022年3月30日 下午9:47，John Smith <java.dev....@gmail.com 
>>> <mailto:java.dev....@gmail.com>> 写道：
>>> 
>>> I have 3 task managers (see config below). There is total of 10 jobs with 
>>> 25 slots being used.
>>> The jobs are 100% ETL I.e; They load Json, transform it and push it to 
>>> JDBC, only 1 job of the 10 is pushing to Apache Ignite cluster.
>>> 
>>> FOR JMAP. I know that it will pause the task manager. So if I run the same 
>>> jobs in my dev env will I still be able to see the similar dump? I I assume 
>>> so. If not I would have to wait at a low volume time to do it on 
>>> production. Aldo if I recall the dump is as big as the JVM memory right so 
>>> if I have 10GB configed for the JVM the dump will be 10GB file?
>>> 
>>> 
>>> # Operating system has 16GB total.
>>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no
>>> 
>>> cluster.evenly-spread-out-slots: true
>>> 
>>> taskmanager.memory.flink.size: 10240m
>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>> taskmanager.numberOfTaskSlots: 16
>>> parallelism.default: 1
>>> 
>>> high-availability: zookeeper
>>> high-availability.storageDir: file:///mnt/flink/ha/flink_1_14/ <>
>>> high-availability.zookeeper.quorum: ...
>>> high-availability.zookeeper.path.root: /flink_1_14
>>> high-availability.cluster-id: /flink_1_14_cluster_0001
>>> 
>>> web.upload.dir: /mnt/flink/uploads/flink_1_14
>>> 
>>> state.backend: rocksdb
>>> state.backend.incremental: true
>>> state.checkpoints.dir: file:///mnt/flink/checkpoints/flink_1_14 <>
>>> state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14 <>
>>> 
>>> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华 <huweihua....@gmail.com 
>>> <mailto:huweihua....@gmail.com>> wrote:
>>> Hi, John
>>> 
>>> Could you tell us you application scenario? Is it a flink session cluster 
>>> with a lot of jobs?
>>> 
>>> Maybe you can try to dump the memory with jmap and use tools such as MAT to 
>>> analyze whether there are abnormal classes and classloaders
>>> 
>>> 
>>> > 2022年3月30日 上午6:09，John Smith <java.dev....@gmail.com 
>>> > <mailto:java.dev....@gmail.com>> 写道：
>>> > 
>>> > Hi running 1.14.4
>>> > 
>>> > My tasks manager still fails with java.lang.OutOfMemoryError: Metaspace. 
>>> > The metaspace out-of-memory error has occurred. This can mean two things: 
>>> > either the job requires a larger size of JVM metaspace to load classes or 
>>> > there is a class loading leak.
>>> > 
>>> > I have 2GB of metaspace configed taskmanager.memory.jvm-metaspace.size: 
>>> > 2048m
>>> > 
>>> > But the task nodes still fail.
>>> > 
>>> > When looking at the UI metrics, the metaspace starts low. Now I see 85% 
>>> > usage. It seems to be a class loading leak at this point, how can we 
>>> > debug this issue?
>>> 
>> 
>

Re: How to debug Metaspace exception?

Reply via email to