Re: How to debug Metaspace exception?

John Smith Wed, 30 Mar 2022 13:01:53 -0700

Also if I manually cancel and restart the same job over and over is it the
same as if flink was restarting a job due to failure?


I.e: When I click "Cancel Job" on the UI is the job completely unloaded vs
when the job scheduler restarts a job because if whatever reason?

Lile this I'll stop and restart the job a few times or maybe I can trick my
job to fail and have the scheduler restart it. Ok let me think about this...

On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 <huweihua....@gmail.com> wrote:

> So if I run the same jobs in my dev env will I still be able to see the
> similar dump?
>
> I think running the same job in dev should be reproducible, maybe you can
> have a try.
>
>  If not I would have to wait at a low volume time to do it on production.
> Aldo if I recall the dump is as big as the JVM memory right so if I have
> 10GB configed for the JVM the dump will be 10GB file?
>
> Yes, JMAP will pause the JVM, the time of pause depends on the size to
> dump. you can use "jmap -dump:live" to dump only the reachable objects,
> this will take a brief pause
>
>
>
> 2022年3月30日 下午9:47，John Smith <java.dev....@gmail.com> 写道：
>
> I have 3 task managers (see config below). There is total of 10 jobs with
> 25 slots being used.
> The jobs are 100% ETL I.e; They load Json, transform it and push it to
> JDBC, only 1 job of the 10 is pushing to Apache Ignite cluster.
>
> FOR JMAP. I know that it will pause the task manager. So if I run the same
> jobs in my dev env will I still be able to see the similar dump? I I assume
> so. If not I would have to wait at a low volume time to do it on
> production. Aldo if I recall the dump is as big as the JVM memory right so
> if I have 10GB configed for the JVM the dump will be 10GB file?
>
>
> # Operating system has 16GB total.
> env.ssh.opts: -l flink -oStrictHostKeyChecking=no
>
> cluster.evenly-spread-out-slots: true
>
> taskmanager.memory.flink.size: 10240m
> taskmanager.memory.jvm-metaspace.size: 2048m
> taskmanager.numberOfTaskSlots: 16
> parallelism.default: 1
>
> high-availability: zookeeper
> high-availability.storageDir: file:///mnt/flink/ha/flink_1_14/
> high-availability.zookeeper.quorum: ...
> high-availability.zookeeper.path.root: /flink_1_14
> high-availability.cluster-id: /flink_1_14_cluster_0001
>
> web.upload.dir: /mnt/flink/uploads/flink_1_14
>
> state.backend: rocksdb
> state.backend.incremental: true
> state.checkpoints.dir: file:///mnt/flink/checkpoints/flink_1_14
> state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14
>
> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华 <huweihua....@gmail.com> wrote:
>
>> Hi, John
>>
>> Could you tell us you application scenario? Is it a flink session cluster
>> with a lot of jobs?
>>
>> Maybe you can try to dump the memory with jmap and use tools such as MAT
>> to analyze whether there are abnormal classes and classloaders
>>
>>
>> > 2022年3月30日 上午6:09，John Smith <java.dev....@gmail.com> 写道：
>> >
>> > Hi running 1.14.4
>> >
>> > My tasks manager still fails with java.lang.OutOfMemoryError:
>> Metaspace. The metaspace out-of-memory error has occurred. This can mean
>> two things: either the job requires a larger size of JVM metaspace to load
>> classes or there is a class loading leak.
>> >
>> > I have 2GB of metaspace configed taskmanager.memory.jvm-metaspace.size:
>> 2048m
>> >
>> > But the task nodes still fail.
>> >
>> > When looking at the UI metrics, the metaspace starts low. Now I see 85%
>> usage. It seems to be a class loading leak at this point, how can we debug
>> this issue?
>>
>>
>

Re: How to debug Metaspace exception?

Reply via email to