Re: How to debug Metaspace exception?

2022-05-02 Thread John Smith
Ok, I don't think I'm running user code on the job manager. Basically. I'm
running a standalone cluster.

3 zookeepers
3 job managers
3 task managers.

I submit my jobs via the UI.

But in case I'll copy the config iver to the job managers.



On Mon, May 2, 2022 at 11:00 AM Chesnay Schepler  wrote:

> There are cases where user-code is run on the JobManager.
> I'm not sure whether though that applies to the JDBC sources.
>
> On 02/05/2022 15:45, John Smith wrote:
>
> Why do the JDBC jars need to be on the job manager node though?
>
> On Mon, May 2, 2022 at 9:36 AM Chesnay Schepler 
> wrote:
>
>> yes.
>> But if you can ensure that the driver isn't bundled by any user-jar you
>> can also skip the pattern configuration step.
>>
>> The pattern looks correct formatting-wise; you could try whether
>> com.microsoft.sqlserver.jdbc. is enough to solve the issue.
>>
>> On 02/05/2022 14:41, John Smith wrote:
>>
>> Oh, so I should copy the jars to the lib folder and
>> set classloader.parent-first-patterns.additional:
>> "org.apache.ignite.;com.microsoft.sqlserver.jdbc." to both the task
>> managers and job managers?
>>
>> Also is my pattern correct?
>> "org.apache.ignite.;com.microsoft.sqlserver.jdbc."
>>
>> Just to be sure I'm running a standalone cluster using zookeeper. So I
>> have 3 zookeepers, 3 job managers and 3 task managers.
>>
>>
>> On Mon, May 2, 2022 at 2:57 AM Chesnay Schepler 
>> wrote:
>>
>>> And you do should make sure that it is set for both processes!
>>>
>>> On 02/05/2022 08:43, Chesnay Schepler wrote:
>>>
>>> The setting itself isn't taskmanager specific; it applies to both the
>>> job- and taskmanager process.
>>>
>>> On 02/05/2022 05:29, John Smith wrote:
>>>
>>> Also just to be sure this is a Task Manager setting right?
>>>
>>> On Thu, Apr 28, 2022 at 11:13 AM John Smith 
>>> wrote:
>>>
 I assume you will take action on your side to track and fix the doc? :)

 On Thu, Apr 28, 2022 at 11:12 AM John Smith 
 wrote:

> Ok so to summarize...
>
> - Build my job jar and have the JDBC driver as a compile only
> dependency and copy the JDBC driver to flink lib folder.
>
> Or
>
> - Build my job jar and include JDBC driver in the shadow, plus copy
> the JDBC driver in the flink lib folder, plus  make an entry in config for
> classloader.parent-first-patterns-additional
> 
>
>
> On Thu, Apr 28, 2022 at 10:17 AM Chesnay Schepler 
> wrote:
>
>> I think what I meant was "either add it to /lib, or [if it is already
>> in /lib but also bundled in the jar] add it to the parent-first 
>> patterns."
>>
>> On 28/04/2022 15:56, Chesnay Schepler wrote:
>>
>> Pretty sure, even though I seemingly documented it incorrectly :)
>>
>> On 28/04/2022 15:49, John Smith wrote:
>>
>> You sure?
>>
>>-
>>
>>*JDBC*: JDBC drivers leak references outside the user code
>>classloader. To ensure that these classes are only loaded once you 
>> should
>>either add the driver jars to Flink’s lib/ folder, or add the
>>driver classes to the list of parent-first loaded class via
>>classloader.parent-first-patterns-additional
>>
>> 
>>.
>>
>>It says either or
>>
>>
>> On Wed, Apr 27, 2022 at 3:44 AM Chesnay Schepler 
>> wrote:
>>
>>> You're misinterpreting the docs.
>>>
>>> The parent/child-first classloading controls where Flink looks for a
>>> class *first*, specifically whether we first load from /lib or the
>>> user-jar.
>>> It does not allow you to load something from the user-jar in the
>>> parent classloader. That's just not how it works.
>>>
>>> It must be in /lib.
>>>
>>> On 27/04/2022 04:59, John Smith wrote:
>>>
>>> Hi Chesnay as per the docs...
>>> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/
>>>
>>> You can either put the jars in task manager lib folder or use
>>> classloader.parent-first-patterns-additional
>>> 
>>>
>>> I prefer the latter like this: the dependency stays with the
>>> user-jar and not on the task manager.
>>>
>>> On Tue, Apr 26, 2022 at 9:52 PM John Smith 
>>> wrote:
>>>
 Ok so I should put the Apache ignite and my Microsoft drivers in
 the lib folders of my task managers?

 And then in my job jar only include them as compile time
 dependencies?


 On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler <

Re: How to debug Metaspace exception?

2022-05-02 Thread Chesnay Schepler

There are cases where user-code is run on the JobManager.
I'm not sure whether though that applies to the JDBC sources.

On 02/05/2022 15:45, John Smith wrote:

Why do the JDBC jars need to be on the job manager node though?

On Mon, May 2, 2022 at 9:36 AM Chesnay Schepler  
wrote:


yes.
But if you can ensure that the driver isn't bundled by any
user-jar you can also skip the pattern configuration step.

The pattern looks correct formatting-wise; you could try whether
com.microsoft.sqlserver.jdbc. is enough to solve the issue.

On 02/05/2022 14:41, John Smith wrote:

Oh, so I should copy the jars to the lib folder and
set classloader.parent-first-patterns.additional:
"org.apache.ignite.;com.microsoft.sqlserver.jdbc." to both the
task managers and job managers?

Also is my pattern correct?
"org.apache.ignite.;com.microsoft.sqlserver.jdbc."

Just to be sure I'm running a standalone cluster using zookeeper.
So I have 3 zookeepers, 3 job managers and 3 task managers.


On Mon, May 2, 2022 at 2:57 AM Chesnay Schepler
 wrote:

And you do should make sure that it is set for both processes!

On 02/05/2022 08:43, Chesnay Schepler wrote:

The setting itself isn't taskmanager specific; it applies to
both the job- and taskmanager process.

On 02/05/2022 05:29, John Smith wrote:

Also just to be sure this is a Task Manager setting right?

On Thu, Apr 28, 2022 at 11:13 AM John Smith
 wrote:

I assume you will take action on your side to track and
fix the doc? :)

On Thu, Apr 28, 2022 at 11:12 AM John Smith
 wrote:

Ok so to summarize...

- Build my job jar and have the JDBC driver as a
compile only dependency and copy the JDBC driver to
flink lib folder.

Or

- Build my job jar and include JDBC driver in the
shadow, plus copy the JDBC driver in the flink lib
folder, plus  make an entry in config for
|classloader.parent-first-patterns-additional|




On Thu, Apr 28, 2022 at 10:17 AM Chesnay Schepler
 wrote:

I think what I meant was "either add it to
/lib, or [if it is already in /lib but also
bundled in the jar] add it to the parent-first
patterns."

On 28/04/2022 15:56, Chesnay Schepler wrote:

Pretty sure, even though I seemingly
documented it incorrectly :)

On 28/04/2022 15:49, John Smith wrote:

You sure?

 *

/JDBC/: JDBC drivers leak references
outside the user code classloader. To
ensure that these classes are only loaded
once you should either add the driver
jars to Flink’s |lib/| folder, or add the
driver classes to the list of
parent-first loaded class via
|classloader.parent-first-patterns-additional|

.

It says either or


On Wed, Apr 27, 2022 at 3:44 AM Chesnay
Schepler  wrote:

You're misinterpreting the docs.

The parent/child-first classloading
controls where Flink looks for a class
/first/, specifically whether we first
load from /lib or the user-jar.
It does not allow you to load something
from the user-jar in the parent
classloader. That's just not how it works.

It must be in /lib.

On 27/04/2022 04:59, John Smith wrote:

Hi Chesnay as per the docs...

https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/

You can either put the jars in task
manager lib folder or use
|classloader.parent-first-patterns-additional|



I prefer the latter like this: the
dependency stays with the user-jar and
not on the task manager.


Re: How to debug Metaspace exception?

2022-05-02 Thread John Smith
Why do the JDBC jars need to be on the job manager node though?

On Mon, May 2, 2022 at 9:36 AM Chesnay Schepler  wrote:

> yes.
> But if you can ensure that the driver isn't bundled by any user-jar you
> can also skip the pattern configuration step.
>
> The pattern looks correct formatting-wise; you could try whether
> com.microsoft.sqlserver.jdbc. is enough to solve the issue.
>
> On 02/05/2022 14:41, John Smith wrote:
>
> Oh, so I should copy the jars to the lib folder and
> set classloader.parent-first-patterns.additional:
> "org.apache.ignite.;com.microsoft.sqlserver.jdbc." to both the task
> managers and job managers?
>
> Also is my pattern correct?
> "org.apache.ignite.;com.microsoft.sqlserver.jdbc."
>
> Just to be sure I'm running a standalone cluster using zookeeper. So I
> have 3 zookeepers, 3 job managers and 3 task managers.
>
>
> On Mon, May 2, 2022 at 2:57 AM Chesnay Schepler 
> wrote:
>
>> And you do should make sure that it is set for both processes!
>>
>> On 02/05/2022 08:43, Chesnay Schepler wrote:
>>
>> The setting itself isn't taskmanager specific; it applies to both the
>> job- and taskmanager process.
>>
>> On 02/05/2022 05:29, John Smith wrote:
>>
>> Also just to be sure this is a Task Manager setting right?
>>
>> On Thu, Apr 28, 2022 at 11:13 AM John Smith 
>> wrote:
>>
>>> I assume you will take action on your side to track and fix the doc? :)
>>>
>>> On Thu, Apr 28, 2022 at 11:12 AM John Smith 
>>> wrote:
>>>
 Ok so to summarize...

 - Build my job jar and have the JDBC driver as a compile only
 dependency and copy the JDBC driver to flink lib folder.

 Or

 - Build my job jar and include JDBC driver in the shadow, plus copy the
 JDBC driver in the flink lib folder, plus  make an entry in config for
 classloader.parent-first-patterns-additional
 


 On Thu, Apr 28, 2022 at 10:17 AM Chesnay Schepler 
 wrote:

> I think what I meant was "either add it to /lib, or [if it is already
> in /lib but also bundled in the jar] add it to the parent-first patterns."
>
> On 28/04/2022 15:56, Chesnay Schepler wrote:
>
> Pretty sure, even though I seemingly documented it incorrectly :)
>
> On 28/04/2022 15:49, John Smith wrote:
>
> You sure?
>
>-
>
>*JDBC*: JDBC drivers leak references outside the user code
>classloader. To ensure that these classes are only loaded once you 
> should
>either add the driver jars to Flink’s lib/ folder, or add the
>driver classes to the list of parent-first loaded class via
>classloader.parent-first-patterns-additional
>
> 
>.
>
>It says either or
>
>
> On Wed, Apr 27, 2022 at 3:44 AM Chesnay Schepler 
> wrote:
>
>> You're misinterpreting the docs.
>>
>> The parent/child-first classloading controls where Flink looks for a
>> class *first*, specifically whether we first load from /lib or the
>> user-jar.
>> It does not allow you to load something from the user-jar in the
>> parent classloader. That's just not how it works.
>>
>> It must be in /lib.
>>
>> On 27/04/2022 04:59, John Smith wrote:
>>
>> Hi Chesnay as per the docs...
>> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/
>>
>> You can either put the jars in task manager lib folder or use
>> classloader.parent-first-patterns-additional
>> 
>>
>> I prefer the latter like this: the dependency stays with the user-jar
>> and not on the task manager.
>>
>> On Tue, Apr 26, 2022 at 9:52 PM John Smith 
>> wrote:
>>
>>> Ok so I should put the Apache ignite and my Microsoft drivers in the
>>> lib folders of my task managers?
>>>
>>> And then in my job jar only include them as compile time
>>> dependencies?
>>>
>>>
>>> On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler <
>>> ches...@apache.org> wrote:
>>>
 JDBC drivers are well-known for leaking classloaders unfortunately.

 You have correctly identified your alternatives.

 You must put the jdbc driver into /lib instead. Setting only the
 parent-first pattern shouldn't affect anything.
 That is only relevant if something is in both in /lib and the
 user-jar, telling Flink to prioritize what is in lib.



 On 26/04/2022 15:35, John Smith wrote:

 So I put classloader.parent-first-patterns.additional:

Re: How to debug Metaspace exception?

2022-05-02 Thread Chesnay Schepler

yes.
But if you can ensure that the driver isn't bundled by any user-jar you 
can also skip the pattern configuration step.


The pattern looks correct formatting-wise; you could try whether 
com.microsoft.sqlserver.jdbc. is enough to solve the issue.


On 02/05/2022 14:41, John Smith wrote:
Oh, so I should copy the jars to the lib folder and 
set classloader.parent-first-patterns.additional: 
"org.apache.ignite.;com.microsoft.sqlserver.jdbc." to both the task 
managers and job managers?


Also is my pattern correct? 
"org.apache.ignite.;com.microsoft.sqlserver.jdbc."


Just to be sure I'm running a standalone cluster using zookeeper. So I 
have 3 zookeepers, 3 job managers and 3 task managers.



On Mon, May 2, 2022 at 2:57 AM Chesnay Schepler  
wrote:


And you do should make sure that it is set for both processes!

On 02/05/2022 08:43, Chesnay Schepler wrote:

The setting itself isn't taskmanager specific; it applies to both
the job- and taskmanager process.

On 02/05/2022 05:29, John Smith wrote:

Also just to be sure this is a Task Manager setting right?

On Thu, Apr 28, 2022 at 11:13 AM John Smith
 wrote:

I assume you will take action on your side to track and fix
the doc? :)

On Thu, Apr 28, 2022 at 11:12 AM John Smith
 wrote:

Ok so to summarize...

- Build my job jar and have the JDBC driver as a compile
only dependency and copy the JDBC driver to flink lib
folder.

Or

- Build my job jar and include JDBC driver in the
shadow, plus copy the JDBC driver in the flink lib
folder, plus  make an entry in config for
|classloader.parent-first-patterns-additional|




On Thu, Apr 28, 2022 at 10:17 AM Chesnay Schepler
 wrote:

I think what I meant was "either add it to /lib, or
[if it is already in /lib but also bundled in the
jar] add it to the parent-first patterns."

On 28/04/2022 15:56, Chesnay Schepler wrote:

Pretty sure, even though I seemingly documented it
incorrectly :)

On 28/04/2022 15:49, John Smith wrote:

You sure?

 *

/JDBC/: JDBC drivers leak references outside
the user code classloader. To ensure that
these classes are only loaded once you should
either add the driver jars to Flink’s
|lib/| folder, or add the driver classes to
the list of parent-first loaded class via
|classloader.parent-first-patterns-additional|

.

It says either or


On Wed, Apr 27, 2022 at 3:44 AM Chesnay Schepler
 wrote:

You're misinterpreting the docs.

The parent/child-first classloading controls
where Flink looks for a class /first/,
specifically whether we first load from /lib
or the user-jar.
It does not allow you to load something from
the user-jar in the parent classloader. That's
just not how it works.

It must be in /lib.

On 27/04/2022 04:59, John Smith wrote:

Hi Chesnay as per the docs...

https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/

You can either put the jars in task manager
lib folder or use
|classloader.parent-first-patterns-additional|



I prefer the latter like this: the
dependency stays with the user-jar and not on
the task manager.

On Tue, Apr 26, 2022 at 9:52 PM John Smith
 wrote:

Ok so I should put the Apache ignite and
my Microsoft drivers in the lib folders
of my task managers?

And then in my job jar only include them
as compile time dependencies?


On Tue, Apr 26, 2022 at 10:42 AM Chesnay
Schepler  wrote:

JDBC drivers are well-known for
leaking classloaders unfortunately.


Re: How to debug Metaspace exception?

2022-05-02 Thread John Smith
Oh, so I should copy the jars to the lib folder and
set classloader.parent-first-patterns.additional:
"org.apache.ignite.;com.microsoft.sqlserver.jdbc." to both the task
managers and job managers?

Also is my pattern correct?
"org.apache.ignite.;com.microsoft.sqlserver.jdbc."

Just to be sure I'm running a standalone cluster using zookeeper. So I have
3 zookeepers, 3 job managers and 3 task managers.


On Mon, May 2, 2022 at 2:57 AM Chesnay Schepler  wrote:

> And you do should make sure that it is set for both processes!
>
> On 02/05/2022 08:43, Chesnay Schepler wrote:
>
> The setting itself isn't taskmanager specific; it applies to both the job-
> and taskmanager process.
>
> On 02/05/2022 05:29, John Smith wrote:
>
> Also just to be sure this is a Task Manager setting right?
>
> On Thu, Apr 28, 2022 at 11:13 AM John Smith 
> wrote:
>
>> I assume you will take action on your side to track and fix the doc? :)
>>
>> On Thu, Apr 28, 2022 at 11:12 AM John Smith 
>> wrote:
>>
>>> Ok so to summarize...
>>>
>>> - Build my job jar and have the JDBC driver as a compile only
>>> dependency and copy the JDBC driver to flink lib folder.
>>>
>>> Or
>>>
>>> - Build my job jar and include JDBC driver in the shadow, plus copy the
>>> JDBC driver in the flink lib folder, plus  make an entry in config for
>>> classloader.parent-first-patterns-additional
>>> 
>>>
>>>
>>> On Thu, Apr 28, 2022 at 10:17 AM Chesnay Schepler 
>>> wrote:
>>>
 I think what I meant was "either add it to /lib, or [if it is already
 in /lib but also bundled in the jar] add it to the parent-first patterns."

 On 28/04/2022 15:56, Chesnay Schepler wrote:

 Pretty sure, even though I seemingly documented it incorrectly :)

 On 28/04/2022 15:49, John Smith wrote:

 You sure?

-

*JDBC*: JDBC drivers leak references outside the user code
classloader. To ensure that these classes are only loaded once you 
 should
either add the driver jars to Flink’s lib/ folder, or add the
driver classes to the list of parent-first loaded class via
classloader.parent-first-patterns-additional

 
.

It says either or


 On Wed, Apr 27, 2022 at 3:44 AM Chesnay Schepler 
 wrote:

> You're misinterpreting the docs.
>
> The parent/child-first classloading controls where Flink looks for a
> class *first*, specifically whether we first load from /lib or the
> user-jar.
> It does not allow you to load something from the user-jar in the
> parent classloader. That's just not how it works.
>
> It must be in /lib.
>
> On 27/04/2022 04:59, John Smith wrote:
>
> Hi Chesnay as per the docs...
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/
>
> You can either put the jars in task manager lib folder or use
> classloader.parent-first-patterns-additional
> 
>
> I prefer the latter like this: the dependency stays with the user-jar
> and not on the task manager.
>
> On Tue, Apr 26, 2022 at 9:52 PM John Smith 
> wrote:
>
>> Ok so I should put the Apache ignite and my Microsoft drivers in the
>> lib folders of my task managers?
>>
>> And then in my job jar only include them as compile time
>> dependencies?
>>
>>
>> On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler 
>> wrote:
>>
>>> JDBC drivers are well-known for leaking classloaders unfortunately.
>>>
>>> You have correctly identified your alternatives.
>>>
>>> You must put the jdbc driver into /lib instead. Setting only the
>>> parent-first pattern shouldn't affect anything.
>>> That is only relevant if something is in both in /lib and the
>>> user-jar, telling Flink to prioritize what is in lib.
>>>
>>>
>>>
>>> On 26/04/2022 15:35, John Smith wrote:
>>>
>>> So I put classloader.parent-first-patterns.additional:
>>> "org.apache.ignite." in the task config and so far I don't think I'm
>>> getting "java.lang.OutOfMemoryError: Metaspace" any more.
>>>
>>> Or it's too early to tell.
>>>
>>> Though now, the task managers are shutting down due to some
>>> other failures.
>>>
>>> So maybe because tasks were failing and reloading often the task
>>> manager was running out of Metspace. But now maybe it's just
>>> cleanly shutting down.
>>>
>>> On Wed, Apr 20, 2022 at 11:35 AM John Smith 
>>> wrote:
>>>
 Or I can put in the config 

Re: How to debug Metaspace exception?

2022-05-02 Thread Chesnay Schepler

And you do should make sure that it is set for both processes!

On 02/05/2022 08:43, Chesnay Schepler wrote:
The setting itself isn't taskmanager specific; it applies to both the 
job- and taskmanager process.


On 02/05/2022 05:29, John Smith wrote:

Also just to be sure this is a Task Manager setting right?

On Thu, Apr 28, 2022 at 11:13 AM John Smith  
wrote:


I assume you will take action on your side to track and fix the
doc? :)

On Thu, Apr 28, 2022 at 11:12 AM John Smith
 wrote:

Ok so to summarize...

- Build my job jar and have the JDBC driver as a compile only
dependency and copy the JDBC driver to flink lib folder.

Or

- Build my job jar and include JDBC driver in the shadow,
plus copy the JDBC driver in the flink lib folder, plus  make
an entry in config for
|classloader.parent-first-patterns-additional|




On Thu, Apr 28, 2022 at 10:17 AM Chesnay Schepler
 wrote:

I think what I meant was "either add it to /lib, or [if
it is already in /lib but also bundled in the jar] add it
to the parent-first patterns."

On 28/04/2022 15:56, Chesnay Schepler wrote:

Pretty sure, even though I seemingly documented it
incorrectly :)

On 28/04/2022 15:49, John Smith wrote:

You sure?

 *

/JDBC/: JDBC drivers leak references outside the
user code classloader. To ensure that these classes
are only loaded once you should either add the
driver jars to Flink’s |lib/| folder, or add the
driver classes to the list of parent-first loaded
class via
|classloader.parent-first-patterns-additional|

.

It says either or


On Wed, Apr 27, 2022 at 3:44 AM Chesnay Schepler
 wrote:

You're misinterpreting the docs.

The parent/child-first classloading controls where
Flink looks for a class /first/, specifically
whether we first load from /lib or the user-jar.
It does not allow you to load something from the
user-jar in the parent classloader. That's just not
how it works.

It must be in /lib.

On 27/04/2022 04:59, John Smith wrote:

Hi Chesnay as per the docs...

https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/

You can either put the jars in task manager lib
folder or use
|classloader.parent-first-patterns-additional|



I prefer the latter like this: the
dependency stays with the user-jar and not on the
task manager.

On Tue, Apr 26, 2022 at 9:52 PM John Smith
 wrote:

Ok so I should put the Apache ignite and my
Microsoft drivers in the lib folders of my
task managers?

And then in my job jar only include them as
compile time dependencies?


On Tue, Apr 26, 2022 at 10:42 AM Chesnay
Schepler  wrote:

JDBC drivers are well-known for leaking
classloaders unfortunately.

You have correctly identified your
alternatives.

You must put the jdbc driver into /lib
instead. Setting only the parent-first
pattern shouldn't affect anything.
That is only relevant if something is in
both in /lib and the user-jar, telling
Flink to prioritize what is in lib.



On 26/04/2022 15:35, John Smith wrote:

So I
put classloader.parent-first-patterns.additional:
"org.apache.ignite." in the task config
and so far I don't think I'm getting
"java.lang.OutOfMemoryError: Metaspace"
any more.

Or it's too early to tell.

Though now, the task managers are
shutting down due to some other failures.

So maybe because tasks were 

Re: How to debug Metaspace exception?

2022-05-02 Thread Chesnay Schepler
The setting itself isn't taskmanager specific; it applies to both the 
job- and taskmanager process.


On 02/05/2022 05:29, John Smith wrote:

Also just to be sure this is a Task Manager setting right?

On Thu, Apr 28, 2022 at 11:13 AM John Smith  
wrote:


I assume you will take action on your side to track and fix the
doc? :)

On Thu, Apr 28, 2022 at 11:12 AM John Smith
 wrote:

Ok so to summarize...

- Build my job jar and have the JDBC driver as a compile only
dependency and copy the JDBC driver to flink lib folder.

Or

- Build my job jar and include JDBC driver in the shadow, plus
copy the JDBC driver in the flink lib folder, plus  make an
entry in config for
|classloader.parent-first-patterns-additional|




On Thu, Apr 28, 2022 at 10:17 AM Chesnay Schepler
 wrote:

I think what I meant was "either add it to /lib, or [if it
is already in /lib but also bundled in the jar] add it to
the parent-first patterns."

On 28/04/2022 15:56, Chesnay Schepler wrote:

Pretty sure, even though I seemingly documented it
incorrectly :)

On 28/04/2022 15:49, John Smith wrote:

You sure?

 *

/JDBC/: JDBC drivers leak references outside the
user code classloader. To ensure that these classes
are only loaded once you should either add the
driver jars to Flink’s |lib/| folder, or add the
driver classes to the list of parent-first loaded
class via
|classloader.parent-first-patterns-additional|

.

It says either or


On Wed, Apr 27, 2022 at 3:44 AM Chesnay Schepler
 wrote:

You're misinterpreting the docs.

The parent/child-first classloading controls where
Flink looks for a class /first/, specifically
whether we first load from /lib or the user-jar.
It does not allow you to load something from the
user-jar in the parent classloader. That's just not
how it works.

It must be in /lib.

On 27/04/2022 04:59, John Smith wrote:

Hi Chesnay as per the docs...

https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/

You can either put the jars in task manager lib
folder or use
|classloader.parent-first-patterns-additional|



I prefer the latter like this: the dependency stays
with the user-jar and not on the task manager.

On Tue, Apr 26, 2022 at 9:52 PM John Smith
 wrote:

Ok so I should put the Apache ignite and my
Microsoft drivers in the lib folders of my task
managers?

And then in my job jar only include them as
compile time dependencies?


On Tue, Apr 26, 2022 at 10:42 AM Chesnay
Schepler  wrote:

JDBC drivers are well-known for leaking
classloaders unfortunately.

You have correctly identified your
alternatives.

You must put the jdbc driver into /lib
instead. Setting only the parent-first
pattern shouldn't affect anything.
That is only relevant if something is in
both in /lib and the user-jar, telling
Flink to prioritize what is in lib.



On 26/04/2022 15:35, John Smith wrote:

So I
put classloader.parent-first-patterns.additional:
"org.apache.ignite." in the task config
and so far I don't think I'm getting
"java.lang.OutOfMemoryError: Metaspace"
any more.

Or it's too early to tell.

Though now, the task managers are shutting
down due to some other failures.

So maybe because tasks were failing and
reloading often the task manager was
running out of Metspace. But 

Re: How to debug Metaspace exception?

2022-05-01 Thread John Smith
Also just to be sure this is a Task Manager setting right?

On Thu, Apr 28, 2022 at 11:13 AM John Smith  wrote:

> I assume you will take action on your side to track and fix the doc? :)
>
> On Thu, Apr 28, 2022 at 11:12 AM John Smith 
> wrote:
>
>> Ok so to summarize...
>>
>> - Build my job jar and have the JDBC driver as a compile only
>> dependency and copy the JDBC driver to flink lib folder.
>>
>> Or
>>
>> - Build my job jar and include JDBC driver in the shadow, plus copy the
>> JDBC driver in the flink lib folder, plus  make an entry in config for
>> classloader.parent-first-patterns-additional
>> 
>>
>>
>> On Thu, Apr 28, 2022 at 10:17 AM Chesnay Schepler 
>> wrote:
>>
>>> I think what I meant was "either add it to /lib, or [if it is already in
>>> /lib but also bundled in the jar] add it to the parent-first patterns."
>>>
>>> On 28/04/2022 15:56, Chesnay Schepler wrote:
>>>
>>> Pretty sure, even though I seemingly documented it incorrectly :)
>>>
>>> On 28/04/2022 15:49, John Smith wrote:
>>>
>>> You sure?
>>>
>>>-
>>>
>>>*JDBC*: JDBC drivers leak references outside the user code
>>>classloader. To ensure that these classes are only loaded once you should
>>>either add the driver jars to Flink’s lib/ folder, or add the driver
>>>classes to the list of parent-first loaded class via
>>>classloader.parent-first-patterns-additional
>>>
>>> 
>>>.
>>>
>>>It says either or
>>>
>>>
>>> On Wed, Apr 27, 2022 at 3:44 AM Chesnay Schepler 
>>> wrote:
>>>
 You're misinterpreting the docs.

 The parent/child-first classloading controls where Flink looks for a
 class *first*, specifically whether we first load from /lib or the
 user-jar.
 It does not allow you to load something from the user-jar in the parent
 classloader. That's just not how it works.

 It must be in /lib.

 On 27/04/2022 04:59, John Smith wrote:

 Hi Chesnay as per the docs...
 https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/

 You can either put the jars in task manager lib folder or use
 classloader.parent-first-patterns-additional
 

 I prefer the latter like this: the dependency stays with the user-jar
 and not on the task manager.

 On Tue, Apr 26, 2022 at 9:52 PM John Smith 
 wrote:

> Ok so I should put the Apache ignite and my Microsoft drivers in the
> lib folders of my task managers?
>
> And then in my job jar only include them as compile time dependencies?
>
>
> On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler 
> wrote:
>
>> JDBC drivers are well-known for leaking classloaders unfortunately.
>>
>> You have correctly identified your alternatives.
>>
>> You must put the jdbc driver into /lib instead. Setting only the
>> parent-first pattern shouldn't affect anything.
>> That is only relevant if something is in both in /lib and the
>> user-jar, telling Flink to prioritize what is in lib.
>>
>>
>>
>> On 26/04/2022 15:35, John Smith wrote:
>>
>> So I put classloader.parent-first-patterns.additional:
>> "org.apache.ignite." in the task config and so far I don't think I'm
>> getting "java.lang.OutOfMemoryError: Metaspace" any more.
>>
>> Or it's too early to tell.
>>
>> Though now, the task managers are shutting down due to some
>> other failures.
>>
>> So maybe because tasks were failing and reloading often the task
>> manager was running out of Metspace. But now maybe it's just
>> cleanly shutting down.
>>
>> On Wed, Apr 20, 2022 at 11:35 AM John Smith 
>> wrote:
>>
>>> Or I can put in the config to treat org.apache.ignite. classes as
>>> first class?
>>>
>>> On Tue, Apr 19, 2022 at 10:18 PM John Smith 
>>> wrote:
>>>
 Ok, so I loaded the dump into Eclipse Mat and followed:
 https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks

 - On the Histogram, I got over 30 entries for: ChildFirstClassLoader
 - Then I clicked on one of them "Merge Shortest Path..." and picked
 "Exclude all phantom/weak/soft references"
 - Which then gave me: SqlDriverManager > Apache Ignite JdbcThin
 Driver

 So i'm guessing anything JDBC based. I should copy into the task
 manager libs folder and my jobs make the dependencies as compile only?

 On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko <
 

Re: How to debug Metaspace exception?

2022-04-28 Thread John Smith
I assume you will take action on your side to track and fix the doc? :)

On Thu, Apr 28, 2022 at 11:12 AM John Smith  wrote:

> Ok so to summarize...
>
> - Build my job jar and have the JDBC driver as a compile only
> dependency and copy the JDBC driver to flink lib folder.
>
> Or
>
> - Build my job jar and include JDBC driver in the shadow, plus copy the
> JDBC driver in the flink lib folder, plus  make an entry in config for
> classloader.parent-first-patterns-additional
> 
>
>
> On Thu, Apr 28, 2022 at 10:17 AM Chesnay Schepler 
> wrote:
>
>> I think what I meant was "either add it to /lib, or [if it is already in
>> /lib but also bundled in the jar] add it to the parent-first patterns."
>>
>> On 28/04/2022 15:56, Chesnay Schepler wrote:
>>
>> Pretty sure, even though I seemingly documented it incorrectly :)
>>
>> On 28/04/2022 15:49, John Smith wrote:
>>
>> You sure?
>>
>>-
>>
>>*JDBC*: JDBC drivers leak references outside the user code
>>classloader. To ensure that these classes are only loaded once you should
>>either add the driver jars to Flink’s lib/ folder, or add the driver
>>classes to the list of parent-first loaded class via
>>classloader.parent-first-patterns-additional
>>
>> 
>>.
>>
>>It says either or
>>
>>
>> On Wed, Apr 27, 2022 at 3:44 AM Chesnay Schepler 
>> wrote:
>>
>>> You're misinterpreting the docs.
>>>
>>> The parent/child-first classloading controls where Flink looks for a
>>> class *first*, specifically whether we first load from /lib or the
>>> user-jar.
>>> It does not allow you to load something from the user-jar in the parent
>>> classloader. That's just not how it works.
>>>
>>> It must be in /lib.
>>>
>>> On 27/04/2022 04:59, John Smith wrote:
>>>
>>> Hi Chesnay as per the docs...
>>> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/
>>>
>>> You can either put the jars in task manager lib folder or use
>>> classloader.parent-first-patterns-additional
>>> 
>>>
>>> I prefer the latter like this: the dependency stays with the user-jar
>>> and not on the task manager.
>>>
>>> On Tue, Apr 26, 2022 at 9:52 PM John Smith 
>>> wrote:
>>>
 Ok so I should put the Apache ignite and my Microsoft drivers in the
 lib folders of my task managers?

 And then in my job jar only include them as compile time dependencies?


 On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler 
 wrote:

> JDBC drivers are well-known for leaking classloaders unfortunately.
>
> You have correctly identified your alternatives.
>
> You must put the jdbc driver into /lib instead. Setting only the
> parent-first pattern shouldn't affect anything.
> That is only relevant if something is in both in /lib and the
> user-jar, telling Flink to prioritize what is in lib.
>
>
>
> On 26/04/2022 15:35, John Smith wrote:
>
> So I put classloader.parent-first-patterns.additional:
> "org.apache.ignite." in the task config and so far I don't think I'm
> getting "java.lang.OutOfMemoryError: Metaspace" any more.
>
> Or it's too early to tell.
>
> Though now, the task managers are shutting down due to some
> other failures.
>
> So maybe because tasks were failing and reloading often the task
> manager was running out of Metspace. But now maybe it's just
> cleanly shutting down.
>
> On Wed, Apr 20, 2022 at 11:35 AM John Smith 
> wrote:
>
>> Or I can put in the config to treat org.apache.ignite. classes as
>> first class?
>>
>> On Tue, Apr 19, 2022 at 10:18 PM John Smith 
>> wrote:
>>
>>> Ok, so I loaded the dump into Eclipse Mat and followed:
>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>
>>> - On the Histogram, I got over 30 entries for: ChildFirstClassLoader
>>> - Then I clicked on one of them "Merge Shortest Path..." and picked
>>> "Exclude all phantom/weak/soft references"
>>> - Which then gave me: SqlDriverManager > Apache Ignite JdbcThin
>>> Driver
>>>
>>> So i'm guessing anything JDBC based. I should copy into the task
>>> manager libs folder and my jobs make the dependencies as compile only?
>>>
>>> On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko <
>>> yaros...@goldsky.io> wrote:
>>>
 Also
 https://shopify.engineering/optimizing-apache-flink-applications-tips
 might be helpful (has a section on profiling, as well as classloading).

 On Tue, Apr 19, 2022 at 4:35 AM 

Re: How to debug Metaspace exception?

2022-04-28 Thread John Smith
Ok so to summarize...

- Build my job jar and have the JDBC driver as a compile only
dependency and copy the JDBC driver to flink lib folder.

Or

- Build my job jar and include JDBC driver in the shadow, plus copy the
JDBC driver in the flink lib folder, plus  make an entry in config for
classloader.parent-first-patterns-additional



On Thu, Apr 28, 2022 at 10:17 AM Chesnay Schepler 
wrote:

> I think what I meant was "either add it to /lib, or [if it is already in
> /lib but also bundled in the jar] add it to the parent-first patterns."
>
> On 28/04/2022 15:56, Chesnay Schepler wrote:
>
> Pretty sure, even though I seemingly documented it incorrectly :)
>
> On 28/04/2022 15:49, John Smith wrote:
>
> You sure?
>
>-
>
>*JDBC*: JDBC drivers leak references outside the user code
>classloader. To ensure that these classes are only loaded once you should
>either add the driver jars to Flink’s lib/ folder, or add the driver
>classes to the list of parent-first loaded class via
>classloader.parent-first-patterns-additional
>
> 
>.
>
>It says either or
>
>
> On Wed, Apr 27, 2022 at 3:44 AM Chesnay Schepler 
> wrote:
>
>> You're misinterpreting the docs.
>>
>> The parent/child-first classloading controls where Flink looks for a
>> class *first*, specifically whether we first load from /lib or the
>> user-jar.
>> It does not allow you to load something from the user-jar in the parent
>> classloader. That's just not how it works.
>>
>> It must be in /lib.
>>
>> On 27/04/2022 04:59, John Smith wrote:
>>
>> Hi Chesnay as per the docs...
>> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/
>>
>> You can either put the jars in task manager lib folder or use
>> classloader.parent-first-patterns-additional
>> 
>>
>> I prefer the latter like this: the dependency stays with the user-jar and
>> not on the task manager.
>>
>> On Tue, Apr 26, 2022 at 9:52 PM John Smith 
>> wrote:
>>
>>> Ok so I should put the Apache ignite and my Microsoft drivers in the lib
>>> folders of my task managers?
>>>
>>> And then in my job jar only include them as compile time dependencies?
>>>
>>>
>>> On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler 
>>> wrote:
>>>
 JDBC drivers are well-known for leaking classloaders unfortunately.

 You have correctly identified your alternatives.

 You must put the jdbc driver into /lib instead. Setting only the
 parent-first pattern shouldn't affect anything.
 That is only relevant if something is in both in /lib and the user-jar,
 telling Flink to prioritize what is in lib.



 On 26/04/2022 15:35, John Smith wrote:

 So I put classloader.parent-first-patterns.additional:
 "org.apache.ignite." in the task config and so far I don't think I'm
 getting "java.lang.OutOfMemoryError: Metaspace" any more.

 Or it's too early to tell.

 Though now, the task managers are shutting down due to some
 other failures.

 So maybe because tasks were failing and reloading often the task
 manager was running out of Metspace. But now maybe it's just
 cleanly shutting down.

 On Wed, Apr 20, 2022 at 11:35 AM John Smith 
 wrote:

> Or I can put in the config to treat org.apache.ignite. classes as
> first class?
>
> On Tue, Apr 19, 2022 at 10:18 PM John Smith 
> wrote:
>
>> Ok, so I loaded the dump into Eclipse Mat and followed:
>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>
>> - On the Histogram, I got over 30 entries for: ChildFirstClassLoader
>> - Then I clicked on one of them "Merge Shortest Path..." and picked
>> "Exclude all phantom/weak/soft references"
>> - Which then gave me: SqlDriverManager > Apache Ignite JdbcThin
>> Driver
>>
>> So i'm guessing anything JDBC based. I should copy into the task
>> manager libs folder and my jobs make the dependencies as compile only?
>>
>> On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko <
>> yaros...@goldsky.io> wrote:
>>
>>> Also
>>> https://shopify.engineering/optimizing-apache-flink-applications-tips
>>> might be helpful (has a section on profiling, as well as classloading).
>>>
>>> On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler 
>>> wrote:
>>>
 We have a very rough "guide" in the wiki (it's just the specific
 steps I took to debug another leak):

 https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks

 

Re: How to debug Metaspace exception?

2022-04-28 Thread Chesnay Schepler
I think what I meant was "either add it to /lib, or [if it is already in 
/lib but also bundled in the jar] add it to the parent-first patterns."


On 28/04/2022 15:56, Chesnay Schepler wrote:

Pretty sure, even though I seemingly documented it incorrectly :)

On 28/04/2022 15:49, John Smith wrote:

You sure?

 *

/JDBC/: JDBC drivers leak references outside the user code
classloader. To ensure that these classes are only loaded once
you should either add the driver jars to Flink’s |lib/| folder,
or add the driver classes to the list of parent-first loaded
class via |classloader.parent-first-patterns-additional|

.

It says either or


On Wed, Apr 27, 2022 at 3:44 AM Chesnay Schepler  
wrote:


You're misinterpreting the docs.

The parent/child-first classloading controls where Flink looks
for a class /first/, specifically whether we first load from /lib
or the user-jar.
It does not allow you to load something from the user-jar in the
parent classloader. That's just not how it works.

It must be in /lib.

On 27/04/2022 04:59, John Smith wrote:

Hi Chesnay as per the docs...

https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/

You can either put the jars in task manager lib folder or use
|classloader.parent-first-patterns-additional|



I prefer the latter like this: the dependency stays with the
user-jar and not on the task manager.

On Tue, Apr 26, 2022 at 9:52 PM John Smith
 wrote:

Ok so I should put the Apache ignite and my Microsoft
drivers in the lib folders of my task managers?

And then in my job jar only include them as compile time
dependencies?


On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler
 wrote:

JDBC drivers are well-known for leaking classloaders
unfortunately.

You have correctly identified your alternatives.

You must put the jdbc driver into /lib instead. Setting
only the parent-first pattern shouldn't affect anything.
That is only relevant if something is in both in /lib
and the user-jar, telling Flink to prioritize what is in
lib.



On 26/04/2022 15:35, John Smith wrote:

So I put classloader.parent-first-patterns.additional:
"org.apache.ignite." in the task config and so far I
don't think I'm getting "java.lang.OutOfMemoryError:
Metaspace" any more.

Or it's too early to tell.

Though now, the task managers are shutting down due to
some other failures.

So maybe because tasks were failing and reloading often
the task manager was running out of Metspace. But now
maybe it's just cleanly shutting down.

On Wed, Apr 20, 2022 at 11:35 AM John Smith
 wrote:

Or I can put in the config to treat
org.apache.ignite. classes as first class?

On Tue, Apr 19, 2022 at 10:18 PM John Smith
 wrote:

Ok, so I loaded the dump into Eclipse Mat and
followed:

https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks

- On the Histogram, I got over 30 entries for:
ChildFirstClassLoader
- Then I clicked on one of them "Merge Shortest
Path..." and picked "Exclude all
phantom/weak/soft references"
- Which then gave me: SqlDriverManager > Apache
Ignite JdbcThin Driver

So i'm guessing anything JDBC based. I should
copy into the task manager libs folder and my
jobs make the dependencies as compile only?

On Tue, Apr 19, 2022 at 12:18 PM Yaroslav
Tkachenko  wrote:

Also

https://shopify.engineering/optimizing-apache-flink-applications-tips
might be helpful (has a section on
profiling, as well as classloading).

On Tue, Apr 19, 2022 at 4:35 AM Chesnay
Schepler  wrote:

We have a very rough "guide" in the
wiki (it's just the specific steps I
took to debug another leak):

https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks

On 19/04/2022 12:01, huweihua 

Re: How to debug Metaspace exception?

2022-04-28 Thread Chesnay Schepler

Pretty sure, even though I seemingly documented it incorrectly :)

On 28/04/2022 15:49, John Smith wrote:

You sure?

 *

/JDBC/: JDBC drivers leak references outside the user code
classloader. To ensure that these classes are only loaded once you
should either add the driver jars to Flink’s |lib/| folder, or add
the driver classes to the list of parent-first loaded class via
|classloader.parent-first-patterns-additional|

.

It says either or


On Wed, Apr 27, 2022 at 3:44 AM Chesnay Schepler  
wrote:


You're misinterpreting the docs.

The parent/child-first classloading controls where Flink looks for
a class /first/, specifically whether we first load from /lib or
the user-jar.
It does not allow you to load something from the user-jar in the
parent classloader. That's just not how it works.

It must be in /lib.

On 27/04/2022 04:59, John Smith wrote:

Hi Chesnay as per the docs...

https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/

You can either put the jars in task manager lib folder or use
|classloader.parent-first-patterns-additional|



I prefer the latter like this: the dependency stays with the
user-jar and not on the task manager.

On Tue, Apr 26, 2022 at 9:52 PM John Smith
 wrote:

Ok so I should put the Apache ignite and my Microsoft drivers
in the lib folders of my task managers?

And then in my job jar only include them as compile time
dependencies?


On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler
 wrote:

JDBC drivers are well-known for leaking classloaders
unfortunately.

You have correctly identified your alternatives.

You must put the jdbc driver into /lib instead. Setting
only the parent-first pattern shouldn't affect anything.
That is only relevant if something is in both in /lib and
the user-jar, telling Flink to prioritize what is in lib.



On 26/04/2022 15:35, John Smith wrote:

So I put classloader.parent-first-patterns.additional:
"org.apache.ignite." in the task config and so far I
don't think I'm getting "java.lang.OutOfMemoryError:
Metaspace" any more.

Or it's too early to tell.

Though now, the task managers are shutting down due to
some other failures.

So maybe because tasks were failing and reloading often
the task manager was running out of Metspace. But now
maybe it's just cleanly shutting down.

On Wed, Apr 20, 2022 at 11:35 AM John Smith
 wrote:

Or I can put in the config to treat
org.apache.ignite. classes as first class?

On Tue, Apr 19, 2022 at 10:18 PM John Smith
 wrote:

Ok, so I loaded the dump into Eclipse Mat and
followed:

https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks

- On the Histogram, I got over 30 entries for:
ChildFirstClassLoader
- Then I clicked on one of them "Merge Shortest
Path..." and picked "Exclude all
phantom/weak/soft references"
- Which then gave me: SqlDriverManager > Apache
Ignite JdbcThin Driver

So i'm guessing anything JDBC based. I should
copy into the task manager libs folder and my
jobs make the dependencies as compile only?

On Tue, Apr 19, 2022 at 12:18 PM Yaroslav
Tkachenko  wrote:

Also

https://shopify.engineering/optimizing-apache-flink-applications-tips
might be helpful (has a section on
profiling, as well as classloading).

On Tue, Apr 19, 2022 at 4:35 AM Chesnay
Schepler  wrote:

We have a very rough "guide" in the wiki
(it's just the specific steps I took to
debug another leak):

https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks

On 19/04/2022 12:01, huweihua wrote:

Hi, John

Sorry for the late reply. You can use
MAT[1] to analyze the dump file. Check
   

Re: How to debug Metaspace exception?

2022-04-28 Thread John Smith
You sure?

   -

   *JDBC*: JDBC drivers leak references outside the user code classloader.
   To ensure that these classes are only loaded once you should either add the
   driver jars to Flink’s lib/ folder, or add the driver classes to the
   list of parent-first loaded class via
   classloader.parent-first-patterns-additional
   

   .

   It says either or


On Wed, Apr 27, 2022 at 3:44 AM Chesnay Schepler  wrote:

> You're misinterpreting the docs.
>
> The parent/child-first classloading controls where Flink looks for a class
> *first*, specifically whether we first load from /lib or the user-jar.
> It does not allow you to load something from the user-jar in the parent
> classloader. That's just not how it works.
>
> It must be in /lib.
>
> On 27/04/2022 04:59, John Smith wrote:
>
> Hi Chesnay as per the docs...
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/
>
> You can either put the jars in task manager lib folder or use
> classloader.parent-first-patterns-additional
> 
>
> I prefer the latter like this: the dependency stays with the user-jar and
> not on the task manager.
>
> On Tue, Apr 26, 2022 at 9:52 PM John Smith  wrote:
>
>> Ok so I should put the Apache ignite and my Microsoft drivers in the lib
>> folders of my task managers?
>>
>> And then in my job jar only include them as compile time dependencies?
>>
>>
>> On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler 
>> wrote:
>>
>>> JDBC drivers are well-known for leaking classloaders unfortunately.
>>>
>>> You have correctly identified your alternatives.
>>>
>>> You must put the jdbc driver into /lib instead. Setting only the
>>> parent-first pattern shouldn't affect anything.
>>> That is only relevant if something is in both in /lib and the user-jar,
>>> telling Flink to prioritize what is in lib.
>>>
>>>
>>>
>>> On 26/04/2022 15:35, John Smith wrote:
>>>
>>> So I put classloader.parent-first-patterns.additional:
>>> "org.apache.ignite." in the task config and so far I don't think I'm
>>> getting "java.lang.OutOfMemoryError: Metaspace" any more.
>>>
>>> Or it's too early to tell.
>>>
>>> Though now, the task managers are shutting down due to some
>>> other failures.
>>>
>>> So maybe because tasks were failing and reloading often the task manager
>>> was running out of Metspace. But now maybe it's just cleanly shutting down.
>>>
>>> On Wed, Apr 20, 2022 at 11:35 AM John Smith 
>>> wrote:
>>>
 Or I can put in the config to treat org.apache.ignite. classes as first
 class?

 On Tue, Apr 19, 2022 at 10:18 PM John Smith 
 wrote:

> Ok, so I loaded the dump into Eclipse Mat and followed:
> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>
> - On the Histogram, I got over 30 entries for: ChildFirstClassLoader
> - Then I clicked on one of them "Merge Shortest Path..." and picked
> "Exclude all phantom/weak/soft references"
> - Which then gave me: SqlDriverManager > Apache Ignite JdbcThin Driver
>
> So i'm guessing anything JDBC based. I should copy into the task
> manager libs folder and my jobs make the dependencies as compile only?
>
> On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko <
> yaros...@goldsky.io> wrote:
>
>> Also
>> https://shopify.engineering/optimizing-apache-flink-applications-tips
>> might be helpful (has a section on profiling, as well as classloading).
>>
>> On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler 
>> wrote:
>>
>>> We have a very rough "guide" in the wiki (it's just the specific
>>> steps I took to debug another leak):
>>>
>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>
>>> On 19/04/2022 12:01, huweihua wrote:
>>>
>>> Hi, John
>>>
>>> Sorry for the late reply. You can use MAT[1] to analyze the dump
>>> file. Check whether have too many loaded classes.
>>>
>>> [1] https://www.eclipse.org/mat/
>>>
>>> 2022年4月18日 下午9:55,John Smith  写道:
>>>
>>> Hi, can anyone help with this? I never looked at a dump file before.
>>>
>>> On Thu, Apr 14, 2022 at 11:59 AM John Smith 
>>> wrote:
>>>
 Hi, so I have a dump file. What do I look for?

 On Thu, Mar 31, 2022 at 3:28 PM John Smith 
 wrote:

> Ok so if there's a leak, if I manually stop the job and restart it
> from the UI multiple times, I won't see the issue because because the
> classes are unloaded correctly?
>
>
> On Thu, Mar 31, 2022 at 9:20 AM huweihua 
> wrote:
>
>>
>> The difference is that manually canceling 

Re: How to debug Metaspace exception?

2022-04-27 Thread Chesnay Schepler

You're misinterpreting the docs.

The parent/child-first classloading controls where Flink looks for a 
class /first/, specifically whether we first load from /lib or the user-jar.
It does not allow you to load something from the user-jar in the parent 
classloader. That's just not how it works.


It must be in /lib.

On 27/04/2022 04:59, John Smith wrote:
Hi Chesnay as per the docs... 
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/


You can either put the jars in task manager lib folder or use 
|classloader.parent-first-patterns-additional| 



I prefer the latter like this: the dependency stays with the user-jar 
and not on the task manager.


On Tue, Apr 26, 2022 at 9:52 PM John Smith  wrote:

Ok so I should put the Apache ignite and my Microsoft drivers in
the lib folders of my task managers?

And then in my job jar only include them as compile time
dependencies?


On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler
 wrote:

JDBC drivers are well-known for leaking classloaders
unfortunately.

You have correctly identified your alternatives.

You must put the jdbc driver into /lib instead. Setting only
the parent-first pattern shouldn't affect anything.
That is only relevant if something is in both in /lib and the
user-jar, telling Flink to prioritize what is in lib.



On 26/04/2022 15:35, John Smith wrote:

So I put classloader.parent-first-patterns.additional:
"org.apache.ignite." in the task config and so far I don't
think I'm getting "java.lang.OutOfMemoryError: Metaspace" any
more.

Or it's too early to tell.

Though now, the task managers are shutting down due to some
other failures.

So maybe because tasks were failing and reloading often the
task manager was running out of Metspace. But now maybe it's
just cleanly shutting down.

On Wed, Apr 20, 2022 at 11:35 AM John Smith
 wrote:

Or I can put in the config to treat org.apache.ignite.
classes as first class?

On Tue, Apr 19, 2022 at 10:18 PM John Smith
 wrote:

Ok, so I loaded the dump into Eclipse Mat and
followed:

https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks

- On the Histogram, I got over 30 entries for:
ChildFirstClassLoader
- Then I clicked on one of them "Merge Shortest
Path..." and picked "Exclude all phantom/weak/soft
references"
- Which then gave me: SqlDriverManager > Apache
Ignite JdbcThin Driver

So i'm guessing anything JDBC based. I should copy
into the task manager libs folder and my jobs make
the dependencies as compile only?

On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko
 wrote:

Also

https://shopify.engineering/optimizing-apache-flink-applications-tips
might be helpful (has a section on profiling, as
well as classloading).

On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler
 wrote:

We have a very rough "guide" in the wiki
(it's just the specific steps I took to debug
another leak):

https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks

On 19/04/2022 12:01, huweihua wrote:

Hi, John

Sorry for the late reply. You can use MAT[1]
to analyze the dump file. Check whether have
too many loaded classes.

[1] https://www.eclipse.org/mat/


2022年4月18日 下午9:55,John Smith
 写道:

Hi, can anyone help with this? I never
looked at a dump file before.

On Thu, Apr 14, 2022 at 11:59 AM John Smith
 wrote:

Hi, so I have a dump file. What do I
look for?

On Thu, Mar 31, 2022 at 3:28 PM John
Smith  wrote:

Ok so if there's a leak, if I
manually stop the job and restart
it from the UI multiple times, I
won't see the issue because because
the classes are unloaded correctly?


  

Re: How to debug Metaspace exception?

2022-04-26 Thread John Smith
Hi Chesnay as per the docs...
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/

You can either put the jars in task manager lib folder or use
classloader.parent-first-patterns-additional


I prefer the latter like this: the dependency stays with the user-jar and
not on the task manager.

On Tue, Apr 26, 2022 at 9:52 PM John Smith  wrote:

> Ok so I should put the Apache ignite and my Microsoft drivers in the lib
> folders of my task managers?
>
> And then in my job jar only include them as compile time dependencies?
>
>
> On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler 
> wrote:
>
>> JDBC drivers are well-known for leaking classloaders unfortunately.
>>
>> You have correctly identified your alternatives.
>>
>> You must put the jdbc driver into /lib instead. Setting only the
>> parent-first pattern shouldn't affect anything.
>> That is only relevant if something is in both in /lib and the user-jar,
>> telling Flink to prioritize what is in lib.
>>
>>
>>
>> On 26/04/2022 15:35, John Smith wrote:
>>
>> So I put classloader.parent-first-patterns.additional:
>> "org.apache.ignite." in the task config and so far I don't think I'm
>> getting "java.lang.OutOfMemoryError: Metaspace" any more.
>>
>> Or it's too early to tell.
>>
>> Though now, the task managers are shutting down due to some
>> other failures.
>>
>> So maybe because tasks were failing and reloading often the task manager
>> was running out of Metspace. But now maybe it's just cleanly shutting down.
>>
>> On Wed, Apr 20, 2022 at 11:35 AM John Smith 
>> wrote:
>>
>>> Or I can put in the config to treat org.apache.ignite. classes as first
>>> class?
>>>
>>> On Tue, Apr 19, 2022 at 10:18 PM John Smith 
>>> wrote:
>>>
 Ok, so I loaded the dump into Eclipse Mat and followed:
 https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks

 - On the Histogram, I got over 30 entries for: ChildFirstClassLoader
 - Then I clicked on one of them "Merge Shortest Path..." and picked
 "Exclude all phantom/weak/soft references"
 - Which then gave me: SqlDriverManager > Apache Ignite JdbcThin Driver

 So i'm guessing anything JDBC based. I should copy into the task
 manager libs folder and my jobs make the dependencies as compile only?

 On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko <
 yaros...@goldsky.io> wrote:

> Also
> https://shopify.engineering/optimizing-apache-flink-applications-tips
> might be helpful (has a section on profiling, as well as classloading).
>
> On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler 
> wrote:
>
>> We have a very rough "guide" in the wiki (it's just the specific
>> steps I took to debug another leak):
>>
>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>
>> On 19/04/2022 12:01, huweihua wrote:
>>
>> Hi, John
>>
>> Sorry for the late reply. You can use MAT[1] to analyze the dump
>> file. Check whether have too many loaded classes.
>>
>> [1] https://www.eclipse.org/mat/
>>
>> 2022年4月18日 下午9:55,John Smith  写道:
>>
>> Hi, can anyone help with this? I never looked at a dump file before.
>>
>> On Thu, Apr 14, 2022 at 11:59 AM John Smith 
>> wrote:
>>
>>> Hi, so I have a dump file. What do I look for?
>>>
>>> On Thu, Mar 31, 2022 at 3:28 PM John Smith 
>>> wrote:
>>>
 Ok so if there's a leak, if I manually stop the job and restart it
 from the UI multiple times, I won't see the issue because because the
 classes are unloaded correctly?


 On Thu, Mar 31, 2022 at 9:20 AM huweihua 
 wrote:

>
> The difference is that manually canceling the job stops the
> JobMaster, but automatic failover keeps the JobMaster running. But 
> looking
> on TaskManager, it doesn't make much difference
>
>
> 2022年3月31日 上午4:01,John Smith  写道:
>
> Also if I manually cancel and restart the same job over and over
> is it the same as if flink was restarting a job due to failure?
>
> I.e: When I click "Cancel Job" on the UI is the job completely
> unloaded vs when the job scheduler restarts a job because if whatever
> reason?
>
> Lile this I'll stop and restart the job a few times or maybe I can
> trick my job to fail and have the scheduler restart it. Ok let me 
> think
> about this...
>
> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 
> wrote:
>
>> So if I run the same jobs in my dev env will I still be able to
>> see the similar dump?
>>
>> I think running the same job in dev should be 

Re: How to debug Metaspace exception?

2022-04-26 Thread John Smith
Ok so I should put the Apache ignite and my Microsoft drivers in the lib
folders of my task managers?

And then in my job jar only include them as compile time dependencies?


On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler 
wrote:

> JDBC drivers are well-known for leaking classloaders unfortunately.
>
> You have correctly identified your alternatives.
>
> You must put the jdbc driver into /lib instead. Setting only the
> parent-first pattern shouldn't affect anything.
> That is only relevant if something is in both in /lib and the user-jar,
> telling Flink to prioritize what is in lib.
>
>
>
> On 26/04/2022 15:35, John Smith wrote:
>
> So I put classloader.parent-first-patterns.additional:
> "org.apache.ignite." in the task config and so far I don't think I'm
> getting "java.lang.OutOfMemoryError: Metaspace" any more.
>
> Or it's too early to tell.
>
> Though now, the task managers are shutting down due to some other failures.
>
> So maybe because tasks were failing and reloading often the task manager
> was running out of Metspace. But now maybe it's just cleanly shutting down.
>
> On Wed, Apr 20, 2022 at 11:35 AM John Smith 
> wrote:
>
>> Or I can put in the config to treat org.apache.ignite. classes as first
>> class?
>>
>> On Tue, Apr 19, 2022 at 10:18 PM John Smith 
>> wrote:
>>
>>> Ok, so I loaded the dump into Eclipse Mat and followed:
>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>
>>> - On the Histogram, I got over 30 entries for: ChildFirstClassLoader
>>> - Then I clicked on one of them "Merge Shortest Path..." and picked
>>> "Exclude all phantom/weak/soft references"
>>> - Which then gave me: SqlDriverManager > Apache Ignite JdbcThin Driver
>>>
>>> So i'm guessing anything JDBC based. I should copy into the task manager
>>> libs folder and my jobs make the dependencies as compile only?
>>>
>>> On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko 
>>> wrote:
>>>
 Also
 https://shopify.engineering/optimizing-apache-flink-applications-tips
 might be helpful (has a section on profiling, as well as classloading).

 On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler 
 wrote:

> We have a very rough "guide" in the wiki (it's just the specific steps
> I took to debug another leak):
>
> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>
> On 19/04/2022 12:01, huweihua wrote:
>
> Hi, John
>
> Sorry for the late reply. You can use MAT[1] to analyze the dump file.
> Check whether have too many loaded classes.
>
> [1] https://www.eclipse.org/mat/
>
> 2022年4月18日 下午9:55,John Smith  写道:
>
> Hi, can anyone help with this? I never looked at a dump file before.
>
> On Thu, Apr 14, 2022 at 11:59 AM John Smith 
> wrote:
>
>> Hi, so I have a dump file. What do I look for?
>>
>> On Thu, Mar 31, 2022 at 3:28 PM John Smith 
>> wrote:
>>
>>> Ok so if there's a leak, if I manually stop the job and restart it
>>> from the UI multiple times, I won't see the issue because because the
>>> classes are unloaded correctly?
>>>
>>>
>>> On Thu, Mar 31, 2022 at 9:20 AM huweihua 
>>> wrote:
>>>

 The difference is that manually canceling the job stops the
 JobMaster, but automatic failover keeps the JobMaster running. But 
 looking
 on TaskManager, it doesn't make much difference


 2022年3月31日 上午4:01,John Smith  写道:

 Also if I manually cancel and restart the same job over and over is
 it the same as if flink was restarting a job due to failure?

 I.e: When I click "Cancel Job" on the UI is the job completely
 unloaded vs when the job scheduler restarts a job because if whatever
 reason?

 Lile this I'll stop and restart the job a few times or maybe I can
 trick my job to fail and have the scheduler restart it. Ok let me think
 about this...

 On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 
 wrote:

> So if I run the same jobs in my dev env will I still be able to
> see the similar dump?
>
> I think running the same job in dev should be reproducible, maybe
> you can have a try.
>
>  If not I would have to wait at a low volume time to do it on
> production. Aldo if I recall the dump is as big as the JVM memory 
> right so
> if I have 10GB configed for the JVM the dump will be 10GB file?
>
> Yes, JMAP will pause the JVM, the time of pause depends on the
> size to dump. you can use "jmap -dump:live" to dump only the reachable
> objects, this will take a brief pause
>
>
>
> 2022年3月30日 下午9:47,John Smith  写道:
>
> I have 3 task managers (see config below). There is total of 10

Re: How to debug Metaspace exception?

2022-04-26 Thread Chesnay Schepler

JDBC drivers are well-known for leaking classloaders unfortunately.

You have correctly identified your alternatives.

You must put the jdbc driver into /lib instead. Setting only the 
parent-first pattern shouldn't affect anything.
That is only relevant if something is in both in /lib and the user-jar, 
telling Flink to prioritize what is in lib.




On 26/04/2022 15:35, John Smith wrote:
So I put classloader.parent-first-patterns.additional: 
"org.apache.ignite." in the task config and so far I don't think I'm 
getting "java.lang.OutOfMemoryError: Metaspace" any more.


Or it's too early to tell.

Though now, the task managers are shutting down due to some 
other failures.


So maybe because tasks were failing and reloading often the task 
manager was running out of Metspace. But now maybe it's just 
cleanly shutting down.


On Wed, Apr 20, 2022 at 11:35 AM John Smith  
wrote:


Or I can put in the config to treat org.apache.ignite. classes as
first class?

On Tue, Apr 19, 2022 at 10:18 PM John Smith
 wrote:

Ok, so I loaded the dump into Eclipse Mat and followed:

https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks

- On the Histogram, I got over 30 entries for:
ChildFirstClassLoader
- Then I clicked on one of them "Merge Shortest Path..." and
picked "Exclude all phantom/weak/soft references"
- Which then gave me: SqlDriverManager > Apache Ignite
JdbcThin Driver

So i'm guessing anything JDBC based. I should copy into the
task manager libs folder and my jobs make the dependencies as
compile only?

On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko
 wrote:

Also

https://shopify.engineering/optimizing-apache-flink-applications-tips
might be helpful (has a section on profiling, as well as
classloading).

On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler
 wrote:

We have a very rough "guide" in the wiki (it's just
the specific steps I took to debug another leak):

https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks

On 19/04/2022 12:01, huweihua wrote:

Hi, John

Sorry for the late reply. You can use MAT[1] to
analyze the dump file. Check whether have too many
loaded classes.

[1] https://www.eclipse.org/mat/


2022年4月18日 下午9:55,John Smith
 写道:

Hi, can anyone help with this? I never looked at a
dump file before.

On Thu, Apr 14, 2022 at 11:59 AM John Smith
 wrote:

Hi, so I have a dump file. What do I look for?

On Thu, Mar 31, 2022 at 3:28 PM John Smith
 wrote:

Ok so if there's a leak, if I manually stop
the job and restart it from the UI multiple
times, I won't see the issue because because
the classes are unloaded correctly?


On Thu, Mar 31, 2022 at 9:20 AM huweihua
 wrote:


The difference is that manually
canceling the job stops the JobMaster,
but automatic failover keeps the
JobMaster running. But looking on
TaskManager, it doesn't make much difference



2022年3月31日 上午4:01,John Smith
 写道:

Also if I manually cancel and restart
the same job over and over is it the
same as if flink was restarting a job
due to failure?

I.e: When I click "Cancel Job" on the
UI is the job completely unloaded vs
when the job scheduler restarts a job
because if whatever reason?

Lile this I'll stop and restart the job
a few times or maybe I can trick my job
to fail and have the scheduler restart
it. Ok let me think about this...

On Wed, Mar 30, 2022 at 10:24 AM 胡伟华
 wrote:


So if I run the same jobs in my
dev env will I still be able to
see the similar dump?

I think running the same job in dev
should be reproducible, maybe you
can have a try.



Re: How to debug Metaspace exception?

2022-04-26 Thread John Smith
So I put classloader.parent-first-patterns.additional: "org.apache.ignite."
in the task config and so far I don't think I'm getting
"java.lang.OutOfMemoryError:
Metaspace" any more.

Or it's too early to tell.

Though now, the task managers are shutting down due to some other failures.

So maybe because tasks were failing and reloading often the task manager
was running out of Metspace. But now maybe it's just cleanly shutting down.

On Wed, Apr 20, 2022 at 11:35 AM John Smith  wrote:

> Or I can put in the config to treat org.apache.ignite. classes as first
> class?
>
> On Tue, Apr 19, 2022 at 10:18 PM John Smith 
> wrote:
>
>> Ok, so I loaded the dump into Eclipse Mat and followed:
>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>
>> - On the Histogram, I got over 30 entries for: ChildFirstClassLoader
>> - Then I clicked on one of them "Merge Shortest Path..." and picked
>> "Exclude all phantom/weak/soft references"
>> - Which then gave me: SqlDriverManager > Apache Ignite JdbcThin Driver
>>
>> So i'm guessing anything JDBC based. I should copy into the task manager
>> libs folder and my jobs make the dependencies as compile only?
>>
>> On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko 
>> wrote:
>>
>>> Also
>>> https://shopify.engineering/optimizing-apache-flink-applications-tips
>>> might be helpful (has a section on profiling, as well as classloading).
>>>
>>> On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler 
>>> wrote:
>>>
 We have a very rough "guide" in the wiki (it's just the specific steps
 I took to debug another leak):

 https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks

 On 19/04/2022 12:01, huweihua wrote:

 Hi, John

 Sorry for the late reply. You can use MAT[1] to analyze the dump file.
 Check whether have too many loaded classes.

 [1] https://www.eclipse.org/mat/

 2022年4月18日 下午9:55,John Smith  写道:

 Hi, can anyone help with this? I never looked at a dump file before.

 On Thu, Apr 14, 2022 at 11:59 AM John Smith 
 wrote:

> Hi, so I have a dump file. What do I look for?
>
> On Thu, Mar 31, 2022 at 3:28 PM John Smith 
> wrote:
>
>> Ok so if there's a leak, if I manually stop the job and restart it
>> from the UI multiple times, I won't see the issue because because the
>> classes are unloaded correctly?
>>
>>
>> On Thu, Mar 31, 2022 at 9:20 AM huweihua 
>> wrote:
>>
>>>
>>> The difference is that manually canceling the job stops the
>>> JobMaster, but automatic failover keeps the JobMaster running. But 
>>> looking
>>> on TaskManager, it doesn't make much difference
>>>
>>>
>>> 2022年3月31日 上午4:01,John Smith  写道:
>>>
>>> Also if I manually cancel and restart the same job over and over is
>>> it the same as if flink was restarting a job due to failure?
>>>
>>> I.e: When I click "Cancel Job" on the UI is the job completely
>>> unloaded vs when the job scheduler restarts a job because if whatever
>>> reason?
>>>
>>> Lile this I'll stop and restart the job a few times or maybe I can
>>> trick my job to fail and have the scheduler restart it. Ok let me think
>>> about this...
>>>
>>> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华  wrote:
>>>
 So if I run the same jobs in my dev env will I still be able to see
 the similar dump?

 I think running the same job in dev should be reproducible, maybe
 you can have a try.

  If not I would have to wait at a low volume time to do it on
 production. Aldo if I recall the dump is as big as the JVM memory 
 right so
 if I have 10GB configed for the JVM the dump will be 10GB file?

 Yes, JMAP will pause the JVM, the time of pause depends on the size
 to dump. you can use "jmap -dump:live" to dump only the reachable 
 objects,
 this will take a brief pause



 2022年3月30日 下午9:47,John Smith  写道:

 I have 3 task managers (see config below). There is total of 10
 jobs with 25 slots being used.
 The jobs are 100% ETL I.e; They load Json, transform it and push it
 to JDBC, only 1 job of the 10 is pushing to Apache Ignite cluster.

 FOR JMAP. I know that it will pause the task manager. So if I run
 the same jobs in my dev env will I still be able to see the similar 
 dump? I
 I assume so. If not I would have to wait at a low volume time to do it 
 on
 production. Aldo if I recall the dump is as big as the JVM memory 
 right so
 if I have 10GB configed for the JVM the dump will be 10GB file?


 # Operating system has 16GB total.
 env.ssh.opts: -l flink -oStrictHostKeyChecking=no

 

Re: How to debug Metaspace exception?

2022-04-20 Thread John Smith
Or I can put in the config to treat org.apache.ignite. classes as first
class?

On Tue, Apr 19, 2022 at 10:18 PM John Smith  wrote:

> Ok, so I loaded the dump into Eclipse Mat and followed:
> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>
> - On the Histogram, I got over 30 entries for: ChildFirstClassLoader
> - Then I clicked on one of them "Merge Shortest Path..." and picked
> "Exclude all phantom/weak/soft references"
> - Which then gave me: SqlDriverManager > Apache Ignite JdbcThin Driver
>
> So i'm guessing anything JDBC based. I should copy into the task manager
> libs folder and my jobs make the dependencies as compile only?
>
> On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko 
> wrote:
>
>> Also
>> https://shopify.engineering/optimizing-apache-flink-applications-tips
>> might be helpful (has a section on profiling, as well as classloading).
>>
>> On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler 
>> wrote:
>>
>>> We have a very rough "guide" in the wiki (it's just the specific steps I
>>> took to debug another leak):
>>>
>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>
>>> On 19/04/2022 12:01, huweihua wrote:
>>>
>>> Hi, John
>>>
>>> Sorry for the late reply. You can use MAT[1] to analyze the dump file.
>>> Check whether have too many loaded classes.
>>>
>>> [1] https://www.eclipse.org/mat/
>>>
>>> 2022年4月18日 下午9:55,John Smith  写道:
>>>
>>> Hi, can anyone help with this? I never looked at a dump file before.
>>>
>>> On Thu, Apr 14, 2022 at 11:59 AM John Smith 
>>> wrote:
>>>
 Hi, so I have a dump file. What do I look for?

 On Thu, Mar 31, 2022 at 3:28 PM John Smith 
 wrote:

> Ok so if there's a leak, if I manually stop the job and restart it
> from the UI multiple times, I won't see the issue because because the
> classes are unloaded correctly?
>
>
> On Thu, Mar 31, 2022 at 9:20 AM huweihua 
> wrote:
>
>>
>> The difference is that manually canceling the job stops the
>> JobMaster, but automatic failover keeps the JobMaster running. But 
>> looking
>> on TaskManager, it doesn't make much difference
>>
>>
>> 2022年3月31日 上午4:01,John Smith  写道:
>>
>> Also if I manually cancel and restart the same job over and over is
>> it the same as if flink was restarting a job due to failure?
>>
>> I.e: When I click "Cancel Job" on the UI is the job completely
>> unloaded vs when the job scheduler restarts a job because if whatever
>> reason?
>>
>> Lile this I'll stop and restart the job a few times or maybe I can
>> trick my job to fail and have the scheduler restart it. Ok let me think
>> about this...
>>
>> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华  wrote:
>>
>>> So if I run the same jobs in my dev env will I still be able to see
>>> the similar dump?
>>>
>>> I think running the same job in dev should be reproducible, maybe
>>> you can have a try.
>>>
>>>  If not I would have to wait at a low volume time to do it on
>>> production. Aldo if I recall the dump is as big as the JVM memory right 
>>> so
>>> if I have 10GB configed for the JVM the dump will be 10GB file?
>>>
>>> Yes, JMAP will pause the JVM, the time of pause depends on the size
>>> to dump. you can use "jmap -dump:live" to dump only the reachable 
>>> objects,
>>> this will take a brief pause
>>>
>>>
>>>
>>> 2022年3月30日 下午9:47,John Smith  写道:
>>>
>>> I have 3 task managers (see config below). There is total of 10 jobs
>>> with 25 slots being used.
>>> The jobs are 100% ETL I.e; They load Json, transform it and push it
>>> to JDBC, only 1 job of the 10 is pushing to Apache Ignite cluster.
>>>
>>> FOR JMAP. I know that it will pause the task manager. So if I run
>>> the same jobs in my dev env will I still be able to see the similar 
>>> dump? I
>>> I assume so. If not I would have to wait at a low volume time to do it 
>>> on
>>> production. Aldo if I recall the dump is as big as the JVM memory right 
>>> so
>>> if I have 10GB configed for the JVM the dump will be 10GB file?
>>>
>>>
>>> # Operating system has 16GB total.
>>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no
>>>
>>> cluster.evenly-spread-out-slots: true
>>>
>>> taskmanager.memory.flink.size: 10240m
>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>> taskmanager.numberOfTaskSlots: 16
>>> parallelism.default: 1
>>>
>>> high-availability: zookeeper
>>> high-availability.storageDir: file:///mnt/flink/ha/flink_1_14/
>>> high-availability.zookeeper.quorum: ...
>>> high-availability.zookeeper.path.root: /flink_1_14
>>> high-availability.cluster-id: /flink_1_14_cluster_0001
>>>
>>> web.upload.dir: /mnt/flink/uploads/flink_1_14
>>>
>>> state.backend: rocksdb
>>> 

Re: How to debug Metaspace exception?

2022-04-19 Thread John Smith
Ok, so I loaded the dump into Eclipse Mat and followed:
https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks

- On the Histogram, I got over 30 entries for: ChildFirstClassLoader
- Then I clicked on one of them "Merge Shortest Path..." and picked
"Exclude all phantom/weak/soft references"
- Which then gave me: SqlDriverManager > Apache Ignite JdbcThin Driver

So i'm guessing anything JDBC based. I should copy into the task manager
libs folder and my jobs make the dependencies as compile only?

On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko 
wrote:

> Also https://shopify.engineering/optimizing-apache-flink-applications-tips
> might be helpful (has a section on profiling, as well as classloading).
>
> On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler 
> wrote:
>
>> We have a very rough "guide" in the wiki (it's just the specific steps I
>> took to debug another leak):
>>
>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>
>> On 19/04/2022 12:01, huweihua wrote:
>>
>> Hi, John
>>
>> Sorry for the late reply. You can use MAT[1] to analyze the dump file.
>> Check whether have too many loaded classes.
>>
>> [1] https://www.eclipse.org/mat/
>>
>> 2022年4月18日 下午9:55,John Smith  写道:
>>
>> Hi, can anyone help with this? I never looked at a dump file before.
>>
>> On Thu, Apr 14, 2022 at 11:59 AM John Smith 
>> wrote:
>>
>>> Hi, so I have a dump file. What do I look for?
>>>
>>> On Thu, Mar 31, 2022 at 3:28 PM John Smith 
>>> wrote:
>>>
 Ok so if there's a leak, if I manually stop the job and restart it from
 the UI multiple times, I won't see the issue because because the classes
 are unloaded correctly?


 On Thu, Mar 31, 2022 at 9:20 AM huweihua 
 wrote:

>
> The difference is that manually canceling the job stops the JobMaster,
> but automatic failover keeps the JobMaster running. But looking on
> TaskManager, it doesn't make much difference
>
>
> 2022年3月31日 上午4:01,John Smith  写道:
>
> Also if I manually cancel and restart the same job over and over is it
> the same as if flink was restarting a job due to failure?
>
> I.e: When I click "Cancel Job" on the UI is the job completely
> unloaded vs when the job scheduler restarts a job because if whatever
> reason?
>
> Lile this I'll stop and restart the job a few times or maybe I can
> trick my job to fail and have the scheduler restart it. Ok let me think
> about this...
>
> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华  wrote:
>
>> So if I run the same jobs in my dev env will I still be able to see
>> the similar dump?
>>
>> I think running the same job in dev should be reproducible, maybe you
>> can have a try.
>>
>>  If not I would have to wait at a low volume time to do it on
>> production. Aldo if I recall the dump is as big as the JVM memory right 
>> so
>> if I have 10GB configed for the JVM the dump will be 10GB file?
>>
>> Yes, JMAP will pause the JVM, the time of pause depends on the size
>> to dump. you can use "jmap -dump:live" to dump only the reachable 
>> objects,
>> this will take a brief pause
>>
>>
>>
>> 2022年3月30日 下午9:47,John Smith  写道:
>>
>> I have 3 task managers (see config below). There is total of 10 jobs
>> with 25 slots being used.
>> The jobs are 100% ETL I.e; They load Json, transform it and push it
>> to JDBC, only 1 job of the 10 is pushing to Apache Ignite cluster.
>>
>> FOR JMAP. I know that it will pause the task manager. So if I run the
>> same jobs in my dev env will I still be able to see the similar dump? I I
>> assume so. If not I would have to wait at a low volume time to do it on
>> production. Aldo if I recall the dump is as big as the JVM memory right 
>> so
>> if I have 10GB configed for the JVM the dump will be 10GB file?
>>
>>
>> # Operating system has 16GB total.
>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no
>>
>> cluster.evenly-spread-out-slots: true
>>
>> taskmanager.memory.flink.size: 10240m
>> taskmanager.memory.jvm-metaspace.size: 2048m
>> taskmanager.numberOfTaskSlots: 16
>> parallelism.default: 1
>>
>> high-availability: zookeeper
>> high-availability.storageDir: file:///mnt/flink/ha/flink_1_14/
>> high-availability.zookeeper.quorum: ...
>> high-availability.zookeeper.path.root: /flink_1_14
>> high-availability.cluster-id: /flink_1_14_cluster_0001
>>
>> web.upload.dir: /mnt/flink/uploads/flink_1_14
>>
>> state.backend: rocksdb
>> state.backend.incremental: true
>> state.checkpoints.dir: file:///mnt/flink/checkpoints/flink_1_14
>> state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14
>>
>> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华  wrote:
>>
>>> Hi, John
>>>
>>> Could you tell us you application 

Re: How to debug Metaspace exception?

2022-04-19 Thread Yaroslav Tkachenko
Also https://shopify.engineering/optimizing-apache-flink-applications-tips
might be helpful (has a section on profiling, as well as classloading).

On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler  wrote:

> We have a very rough "guide" in the wiki (it's just the specific steps I
> took to debug another leak):
>
> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>
> On 19/04/2022 12:01, huweihua wrote:
>
> Hi, John
>
> Sorry for the late reply. You can use MAT[1] to analyze the dump file.
> Check whether have too many loaded classes.
>
> [1] https://www.eclipse.org/mat/
>
> 2022年4月18日 下午9:55,John Smith  写道:
>
> Hi, can anyone help with this? I never looked at a dump file before.
>
> On Thu, Apr 14, 2022 at 11:59 AM John Smith 
> wrote:
>
>> Hi, so I have a dump file. What do I look for?
>>
>> On Thu, Mar 31, 2022 at 3:28 PM John Smith 
>> wrote:
>>
>>> Ok so if there's a leak, if I manually stop the job and restart it from
>>> the UI multiple times, I won't see the issue because because the classes
>>> are unloaded correctly?
>>>
>>>
>>> On Thu, Mar 31, 2022 at 9:20 AM huweihua  wrote:
>>>

 The difference is that manually canceling the job stops the JobMaster,
 but automatic failover keeps the JobMaster running. But looking on
 TaskManager, it doesn't make much difference


 2022年3月31日 上午4:01,John Smith  写道:

 Also if I manually cancel and restart the same job over and over is it
 the same as if flink was restarting a job due to failure?

 I.e: When I click "Cancel Job" on the UI is the job completely unloaded
 vs when the job scheduler restarts a job because if whatever reason?

 Lile this I'll stop and restart the job a few times or maybe I can
 trick my job to fail and have the scheduler restart it. Ok let me think
 about this...

 On Wed, Mar 30, 2022 at 10:24 AM 胡伟华  wrote:

> So if I run the same jobs in my dev env will I still be able to see
> the similar dump?
>
> I think running the same job in dev should be reproducible, maybe you
> can have a try.
>
>  If not I would have to wait at a low volume time to do it on
> production. Aldo if I recall the dump is as big as the JVM memory right so
> if I have 10GB configed for the JVM the dump will be 10GB file?
>
> Yes, JMAP will pause the JVM, the time of pause depends on the size to
> dump. you can use "jmap -dump:live" to dump only the reachable objects,
> this will take a brief pause
>
>
>
> 2022年3月30日 下午9:47,John Smith  写道:
>
> I have 3 task managers (see config below). There is total of 10 jobs
> with 25 slots being used.
> The jobs are 100% ETL I.e; They load Json, transform it and push it to
> JDBC, only 1 job of the 10 is pushing to Apache Ignite cluster.
>
> FOR JMAP. I know that it will pause the task manager. So if I run the
> same jobs in my dev env will I still be able to see the similar dump? I I
> assume so. If not I would have to wait at a low volume time to do it on
> production. Aldo if I recall the dump is as big as the JVM memory right so
> if I have 10GB configed for the JVM the dump will be 10GB file?
>
>
> # Operating system has 16GB total.
> env.ssh.opts: -l flink -oStrictHostKeyChecking=no
>
> cluster.evenly-spread-out-slots: true
>
> taskmanager.memory.flink.size: 10240m
> taskmanager.memory.jvm-metaspace.size: 2048m
> taskmanager.numberOfTaskSlots: 16
> parallelism.default: 1
>
> high-availability: zookeeper
> high-availability.storageDir: file:///mnt/flink/ha/flink_1_14/
> high-availability.zookeeper.quorum: ...
> high-availability.zookeeper.path.root: /flink_1_14
> high-availability.cluster-id: /flink_1_14_cluster_0001
>
> web.upload.dir: /mnt/flink/uploads/flink_1_14
>
> state.backend: rocksdb
> state.backend.incremental: true
> state.checkpoints.dir: file:///mnt/flink/checkpoints/flink_1_14
> state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14
>
> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华  wrote:
>
>> Hi, John
>>
>> Could you tell us you application scenario? Is it a flink session
>> cluster with a lot of jobs?
>>
>> Maybe you can try to dump the memory with jmap and use tools such as
>> MAT to analyze whether there are abnormal classes and classloaders
>>
>>
>> > 2022年3月30日 上午6:09,John Smith  写道:
>> >
>> > Hi running 1.14.4
>> >
>> > My tasks manager still fails with java.lang.OutOfMemoryError:
>> Metaspace. The metaspace out-of-memory error has occurred. This can mean
>> two things: either the job requires a larger size of JVM metaspace to 
>> load
>> classes or there is a class loading leak.
>> >
>> > I have 2GB of metaspace configed
>> taskmanager.memory.jvm-metaspace.size: 2048m
>> >
>> > But the 

Re: How to debug Metaspace exception?

2022-04-19 Thread Chesnay Schepler
We have a very rough "guide" in the wiki (it's just the specific steps I 
took to debug another leak):

https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks

On 19/04/2022 12:01, huweihua wrote:

Hi, John

Sorry for the late reply. You can use MAT[1] to analyze the dump file. 
Check whether have too many loaded classes.


[1] https://www.eclipse.org/mat/


2022年4月18日 下午9:55,John Smith  写道:

Hi, can anyone help with this? I never looked at a dump file before.

On Thu, Apr 14, 2022 at 11:59 AM John Smith  
wrote:


Hi, so I have a dump file. What do I look for?

On Thu, Mar 31, 2022 at 3:28 PM John Smith
 wrote:

Ok so if there's a leak, if I manually stop the job and
restart it from the UI multiple times, I won't see the issue
because because the classes are unloaded correctly?


On Thu, Mar 31, 2022 at 9:20 AM huweihua
 wrote:


The difference is that manually canceling the job stops
the JobMaster, but automatic failover keeps the JobMaster
running. But looking on TaskManager, it doesn't make much
difference



2022年3月31日 上午4:01,John Smith 
写道:

Also if I manually cancel and restart the same job over
and over is it the same as if flink was restarting a job
due to failure?

I.e: When I click "Cancel Job" on the UI is the job
completely unloaded vs when the job scheduler restarts a
job because if whatever reason?

Lile this I'll stop and restart the job a few times or
maybe I can trick my job to fail and have the scheduler
restart it. Ok let me think about this...

On Wed, Mar 30, 2022 at 10:24 AM 胡伟华
 wrote:


So if I run the same jobs in my dev env will I
still be able to see the similar dump?

I think running the same job in dev should be
reproducible, maybe you can have a try.


 If not I would have to wait at a low volume time
to do it on production. Aldo if I recall the dump
is as big as the JVM memory right so if I have 10GB
configed for the JVM the dump will be 10GB file?

Yes, JMAP will pause the JVM, the time of pause
depends on the size to dump. you can use "jmap
-dump:live" to dump only the reachable objects, this
will take a brief pause




2022年3月30日 下午9:47,John Smith
 写道:

I have 3 task managers (see config below). There is
total of 10 jobs with 25 slots being used.
The jobs are 100% ETL I.e; They load Json,
transform it and push it to JDBC, only 1 job of the
10 is pushing to Apache Ignite cluster.

FOR JMAP. I know that it will pause the task
manager. So if I run the same jobs in my dev env
will I still be able to see the similar dump? I I
assume so. If not I would have to wait at a low
volume time to do it on production. Aldo if I
recall the dump is as big as the JVM memory right
so if I have 10GB configed for the JVM the dump
will be 10GB file?


# Operating system has 16GB total.
env.ssh.opts: -l flink -oStrictHostKeyChecking=no

cluster.evenly-spread-out-slots: true

taskmanager.memory.flink.size: 10240m
taskmanager.memory.jvm-metaspace.size: 2048m
taskmanager.numberOfTaskSlots: 16
parallelism.default: 1

high-availability: zookeeper
high-availability.storageDir:
file:///mnt/flink/ha/flink_1_14/
high-availability.zookeeper.quorum: ...
high-availability.zookeeper.path.root: /flink_1_14
high-availability.cluster-id: /flink_1_14_cluster_0001

web.upload.dir: /mnt/flink/uploads/flink_1_14

state.backend: rocksdb
state.backend.incremental: true
state.checkpoints.dir:
file:///mnt/flink/checkpoints/flink_1_14
state.savepoints.dir:
file:///mnt/flink/savepoints/flink_1_14

On Wed, Mar 30, 2022 at 2:16 AM 胡伟华
 wrote:

Hi, John

Could you tell us you application scenario? Is
it a flink session cluster with a lot of jobs?

Maybe you can try to dump the memory with jmap
and use tools such as MAT to analyze whether
there are abnormal classes and classloaders


> 2022年3月30日 上午6:09,John Smith
 写道:
 

Re: How to debug Metaspace exception?

2022-04-19 Thread huweihua
Hi, John

Sorry for the late reply. You can use MAT[1] to analyze the dump file. Check 
whether have too many loaded classes.

[1] https://www.eclipse.org/mat/

> 2022年4月18日 下午9:55,John Smith  写道:
> 
> Hi, can anyone help with this? I never looked at a dump file before.
> 
> On Thu, Apr 14, 2022 at 11:59 AM John Smith  > wrote:
> Hi, so I have a dump file. What do I look for?
> 
> On Thu, Mar 31, 2022 at 3:28 PM John Smith  > wrote:
> Ok so if there's a leak, if I manually stop the job and restart it from the 
> UI multiple times, I won't see the issue because because the classes are 
> unloaded correctly?
> 
> 
> On Thu, Mar 31, 2022 at 9:20 AM huweihua  > wrote:
> 
> The difference is that manually canceling the job stops the JobMaster, but 
> automatic failover keeps the JobMaster running. But looking on TaskManager, 
> it doesn't make much difference
> 
> 
>> 2022年3月31日 上午4:01,John Smith > > 写道:
>> 
>> Also if I manually cancel and restart the same job over and over is it the 
>> same as if flink was restarting a job due to failure?
>> 
>> I.e: When I click "Cancel Job" on the UI is the job completely unloaded vs 
>> when the job scheduler restarts a job because if whatever reason?
>> 
>> Lile this I'll stop and restart the job a few times or maybe I can trick my 
>> job to fail and have the scheduler restart it. Ok let me think about this...
>> 
>> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 > > wrote:
>>> So if I run the same jobs in my dev env will I still be able to see the 
>>> similar dump? 
>> I think running the same job in dev should be reproducible, maybe you can 
>> have a try.
>> 
>>>  If not I would have to wait at a low volume time to do it on production. 
>>> Aldo if I recall the dump is as big as the JVM memory right so if I have 
>>> 10GB configed for the JVM the dump will be 10GB file?
>> 
>> Yes, JMAP will pause the JVM, the time of pause depends on the size to dump. 
>> you can use "jmap -dump:live" to dump only the reachable objects, this will 
>> take a brief pause
>> 
>> 
>> 
>>> 2022年3月30日 下午9:47,John Smith >> > 写道:
>>> 
>>> I have 3 task managers (see config below). There is total of 10 jobs with 
>>> 25 slots being used.
>>> The jobs are 100% ETL I.e; They load Json, transform it and push it to 
>>> JDBC, only 1 job of the 10 is pushing to Apache Ignite cluster.
>>> 
>>> FOR JMAP. I know that it will pause the task manager. So if I run the same 
>>> jobs in my dev env will I still be able to see the similar dump? I I assume 
>>> so. If not I would have to wait at a low volume time to do it on 
>>> production. Aldo if I recall the dump is as big as the JVM memory right so 
>>> if I have 10GB configed for the JVM the dump will be 10GB file?
>>> 
>>> 
>>> # Operating system has 16GB total.
>>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no
>>> 
>>> cluster.evenly-spread-out-slots: true
>>> 
>>> taskmanager.memory.flink.size: 10240m
>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>> taskmanager.numberOfTaskSlots: 16
>>> parallelism.default: 1
>>> 
>>> high-availability: zookeeper
>>> high-availability.storageDir: file:///mnt/flink/ha/flink_1_14/ <>
>>> high-availability.zookeeper.quorum: ...
>>> high-availability.zookeeper.path.root: /flink_1_14
>>> high-availability.cluster-id: /flink_1_14_cluster_0001
>>> 
>>> web.upload.dir: /mnt/flink/uploads/flink_1_14
>>> 
>>> state.backend: rocksdb
>>> state.backend.incremental: true
>>> state.checkpoints.dir: file:///mnt/flink/checkpoints/flink_1_14 <>
>>> state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14 <>
>>> 
>>> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华 >> > wrote:
>>> Hi, John
>>> 
>>> Could you tell us you application scenario? Is it a flink session cluster 
>>> with a lot of jobs?
>>> 
>>> Maybe you can try to dump the memory with jmap and use tools such as MAT to 
>>> analyze whether there are abnormal classes and classloaders
>>> 
>>> 
>>> > 2022年3月30日 上午6:09,John Smith >> > > 写道:
>>> > 
>>> > Hi running 1.14.4
>>> > 
>>> > My tasks manager still fails with java.lang.OutOfMemoryError: Metaspace. 
>>> > The metaspace out-of-memory error has occurred. This can mean two things: 
>>> > either the job requires a larger size of JVM metaspace to load classes or 
>>> > there is a class loading leak.
>>> > 
>>> > I have 2GB of metaspace configed taskmanager.memory.jvm-metaspace.size: 
>>> > 2048m
>>> > 
>>> > But the task nodes still fail.
>>> > 
>>> > When looking at the UI metrics, the metaspace starts low. Now I see 85% 
>>> > usage. It seems to be a class loading leak at this point, how can we 
>>> > debug this issue?
>>> 
>> 
> 



Re: How to debug Metaspace exception?

2022-04-18 Thread John Smith
Hi, can anyone help with this? I never looked at a dump file before.

On Thu, Apr 14, 2022 at 11:59 AM John Smith  wrote:

> Hi, so I have a dump file. What do I look for?
>
> On Thu, Mar 31, 2022 at 3:28 PM John Smith  wrote:
>
>> Ok so if there's a leak, if I manually stop the job and restart it from
>> the UI multiple times, I won't see the issue because because the classes
>> are unloaded correctly?
>>
>>
>> On Thu, Mar 31, 2022 at 9:20 AM huweihua  wrote:
>>
>>>
>>> The difference is that manually canceling the job stops the JobMaster,
>>> but automatic failover keeps the JobMaster running. But looking on
>>> TaskManager, it doesn't make much difference
>>>
>>>
>>> 2022年3月31日 上午4:01,John Smith  写道:
>>>
>>> Also if I manually cancel and restart the same job over and over is it
>>> the same as if flink was restarting a job due to failure?
>>>
>>> I.e: When I click "Cancel Job" on the UI is the job completely unloaded
>>> vs when the job scheduler restarts a job because if whatever reason?
>>>
>>> Lile this I'll stop and restart the job a few times or maybe I can trick
>>> my job to fail and have the scheduler restart it. Ok let me think about
>>> this...
>>>
>>> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华  wrote:
>>>
 So if I run the same jobs in my dev env will I still be able to see the
 similar dump?

 I think running the same job in dev should be reproducible, maybe you
 can have a try.

  If not I would have to wait at a low volume time to do it on
 production. Aldo if I recall the dump is as big as the JVM memory right so
 if I have 10GB configed for the JVM the dump will be 10GB file?

 Yes, JMAP will pause the JVM, the time of pause depends on the size to
 dump. you can use "jmap -dump:live" to dump only the reachable objects,
 this will take a brief pause



 2022年3月30日 下午9:47,John Smith  写道:

 I have 3 task managers (see config below). There is total of 10 jobs
 with 25 slots being used.
 The jobs are 100% ETL I.e; They load Json, transform it and push it to
 JDBC, only 1 job of the 10 is pushing to Apache Ignite cluster.

 FOR JMAP. I know that it will pause the task manager. So if I run the
 same jobs in my dev env will I still be able to see the similar dump? I I
 assume so. If not I would have to wait at a low volume time to do it on
 production. Aldo if I recall the dump is as big as the JVM memory right so
 if I have 10GB configed for the JVM the dump will be 10GB file?


 # Operating system has 16GB total.
 env.ssh.opts: -l flink -oStrictHostKeyChecking=no

 cluster.evenly-spread-out-slots: true

 taskmanager.memory.flink.size: 10240m
 taskmanager.memory.jvm-metaspace.size: 2048m
 taskmanager.numberOfTaskSlots: 16
 parallelism.default: 1

 high-availability: zookeeper
 high-availability.storageDir: file:///mnt/flink/ha/flink_1_14/
 high-availability.zookeeper.quorum: ...
 high-availability.zookeeper.path.root: /flink_1_14
 high-availability.cluster-id: /flink_1_14_cluster_0001

 web.upload.dir: /mnt/flink/uploads/flink_1_14

 state.backend: rocksdb
 state.backend.incremental: true
 state.checkpoints.dir: file:///mnt/flink/checkpoints/flink_1_14
 state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14

 On Wed, Mar 30, 2022 at 2:16 AM 胡伟华  wrote:

> Hi, John
>
> Could you tell us you application scenario? Is it a flink session
> cluster with a lot of jobs?
>
> Maybe you can try to dump the memory with jmap and use tools such as
> MAT to analyze whether there are abnormal classes and classloaders
>
>
> > 2022年3月30日 上午6:09,John Smith  写道:
> >
> > Hi running 1.14.4
> >
> > My tasks manager still fails with java.lang.OutOfMemoryError:
> Metaspace. The metaspace out-of-memory error has occurred. This can mean
> two things: either the job requires a larger size of JVM metaspace to load
> classes or there is a class loading leak.
> >
> > I have 2GB of metaspace configed
> taskmanager.memory.jvm-metaspace.size: 2048m
> >
> > But the task nodes still fail.
> >
> > When looking at the UI metrics, the metaspace starts low. Now I see
> 85% usage. It seems to be a class loading leak at this point, how can we
> debug this issue?
>
>

>>>


Re: How to debug Metaspace exception?

2022-04-14 Thread John Smith
Hi, so I have a dump file. What do I look for?

On Thu, Mar 31, 2022 at 3:28 PM John Smith  wrote:

> Ok so if there's a leak, if I manually stop the job and restart it from
> the UI multiple times, I won't see the issue because because the classes
> are unloaded correctly?
>
>
> On Thu, Mar 31, 2022 at 9:20 AM huweihua  wrote:
>
>>
>> The difference is that manually canceling the job stops the JobMaster,
>> but automatic failover keeps the JobMaster running. But looking on
>> TaskManager, it doesn't make much difference
>>
>>
>> 2022年3月31日 上午4:01,John Smith  写道:
>>
>> Also if I manually cancel and restart the same job over and over is it
>> the same as if flink was restarting a job due to failure?
>>
>> I.e: When I click "Cancel Job" on the UI is the job completely unloaded
>> vs when the job scheduler restarts a job because if whatever reason?
>>
>> Lile this I'll stop and restart the job a few times or maybe I can trick
>> my job to fail and have the scheduler restart it. Ok let me think about
>> this...
>>
>> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华  wrote:
>>
>>> So if I run the same jobs in my dev env will I still be able to see the
>>> similar dump?
>>>
>>> I think running the same job in dev should be reproducible, maybe you
>>> can have a try.
>>>
>>>  If not I would have to wait at a low volume time to do it on
>>> production. Aldo if I recall the dump is as big as the JVM memory right so
>>> if I have 10GB configed for the JVM the dump will be 10GB file?
>>>
>>> Yes, JMAP will pause the JVM, the time of pause depends on the size to
>>> dump. you can use "jmap -dump:live" to dump only the reachable objects,
>>> this will take a brief pause
>>>
>>>
>>>
>>> 2022年3月30日 下午9:47,John Smith  写道:
>>>
>>> I have 3 task managers (see config below). There is total of 10 jobs
>>> with 25 slots being used.
>>> The jobs are 100% ETL I.e; They load Json, transform it and push it to
>>> JDBC, only 1 job of the 10 is pushing to Apache Ignite cluster.
>>>
>>> FOR JMAP. I know that it will pause the task manager. So if I run the
>>> same jobs in my dev env will I still be able to see the similar dump? I I
>>> assume so. If not I would have to wait at a low volume time to do it on
>>> production. Aldo if I recall the dump is as big as the JVM memory right so
>>> if I have 10GB configed for the JVM the dump will be 10GB file?
>>>
>>>
>>> # Operating system has 16GB total.
>>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no
>>>
>>> cluster.evenly-spread-out-slots: true
>>>
>>> taskmanager.memory.flink.size: 10240m
>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>> taskmanager.numberOfTaskSlots: 16
>>> parallelism.default: 1
>>>
>>> high-availability: zookeeper
>>> high-availability.storageDir: file:///mnt/flink/ha/flink_1_14/
>>> high-availability.zookeeper.quorum: ...
>>> high-availability.zookeeper.path.root: /flink_1_14
>>> high-availability.cluster-id: /flink_1_14_cluster_0001
>>>
>>> web.upload.dir: /mnt/flink/uploads/flink_1_14
>>>
>>> state.backend: rocksdb
>>> state.backend.incremental: true
>>> state.checkpoints.dir: file:///mnt/flink/checkpoints/flink_1_14
>>> state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14
>>>
>>> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华  wrote:
>>>
 Hi, John

 Could you tell us you application scenario? Is it a flink session
 cluster with a lot of jobs?

 Maybe you can try to dump the memory with jmap and use tools such as
 MAT to analyze whether there are abnormal classes and classloaders


 > 2022年3月30日 上午6:09,John Smith  写道:
 >
 > Hi running 1.14.4
 >
 > My tasks manager still fails with java.lang.OutOfMemoryError:
 Metaspace. The metaspace out-of-memory error has occurred. This can mean
 two things: either the job requires a larger size of JVM metaspace to load
 classes or there is a class loading leak.
 >
 > I have 2GB of metaspace configed
 taskmanager.memory.jvm-metaspace.size: 2048m
 >
 > But the task nodes still fail.
 >
 > When looking at the UI metrics, the metaspace starts low. Now I see
 85% usage. It seems to be a class loading leak at this point, how can we
 debug this issue?


>>>
>>


Re: How to debug Metaspace exception?

2022-03-31 Thread John Smith
Ok so if there's a leak, if I manually stop the job and restart it from the
UI multiple times, I won't see the issue because because the classes are
unloaded correctly?


On Thu, Mar 31, 2022 at 9:20 AM huweihua  wrote:

>
> The difference is that manually canceling the job stops the JobMaster, but
> automatic failover keeps the JobMaster running. But looking on TaskManager,
> it doesn't make much difference
>
>
> 2022年3月31日 上午4:01,John Smith  写道:
>
> Also if I manually cancel and restart the same job over and over is it the
> same as if flink was restarting a job due to failure?
>
> I.e: When I click "Cancel Job" on the UI is the job completely unloaded vs
> when the job scheduler restarts a job because if whatever reason?
>
> Lile this I'll stop and restart the job a few times or maybe I can trick
> my job to fail and have the scheduler restart it. Ok let me think about
> this...
>
> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华  wrote:
>
>> So if I run the same jobs in my dev env will I still be able to see the
>> similar dump?
>>
>> I think running the same job in dev should be reproducible, maybe you can
>> have a try.
>>
>>  If not I would have to wait at a low volume time to do it on production.
>> Aldo if I recall the dump is as big as the JVM memory right so if I have
>> 10GB configed for the JVM the dump will be 10GB file?
>>
>> Yes, JMAP will pause the JVM, the time of pause depends on the size to
>> dump. you can use "jmap -dump:live" to dump only the reachable objects,
>> this will take a brief pause
>>
>>
>>
>> 2022年3月30日 下午9:47,John Smith  写道:
>>
>> I have 3 task managers (see config below). There is total of 10 jobs with
>> 25 slots being used.
>> The jobs are 100% ETL I.e; They load Json, transform it and push it to
>> JDBC, only 1 job of the 10 is pushing to Apache Ignite cluster.
>>
>> FOR JMAP. I know that it will pause the task manager. So if I run the
>> same jobs in my dev env will I still be able to see the similar dump? I I
>> assume so. If not I would have to wait at a low volume time to do it on
>> production. Aldo if I recall the dump is as big as the JVM memory right so
>> if I have 10GB configed for the JVM the dump will be 10GB file?
>>
>>
>> # Operating system has 16GB total.
>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no
>>
>> cluster.evenly-spread-out-slots: true
>>
>> taskmanager.memory.flink.size: 10240m
>> taskmanager.memory.jvm-metaspace.size: 2048m
>> taskmanager.numberOfTaskSlots: 16
>> parallelism.default: 1
>>
>> high-availability: zookeeper
>> high-availability.storageDir: file:///mnt/flink/ha/flink_1_14/
>> high-availability.zookeeper.quorum: ...
>> high-availability.zookeeper.path.root: /flink_1_14
>> high-availability.cluster-id: /flink_1_14_cluster_0001
>>
>> web.upload.dir: /mnt/flink/uploads/flink_1_14
>>
>> state.backend: rocksdb
>> state.backend.incremental: true
>> state.checkpoints.dir: file:///mnt/flink/checkpoints/flink_1_14
>> state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14
>>
>> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华  wrote:
>>
>>> Hi, John
>>>
>>> Could you tell us you application scenario? Is it a flink session
>>> cluster with a lot of jobs?
>>>
>>> Maybe you can try to dump the memory with jmap and use tools such as MAT
>>> to analyze whether there are abnormal classes and classloaders
>>>
>>>
>>> > 2022年3月30日 上午6:09,John Smith  写道:
>>> >
>>> > Hi running 1.14.4
>>> >
>>> > My tasks manager still fails with java.lang.OutOfMemoryError:
>>> Metaspace. The metaspace out-of-memory error has occurred. This can mean
>>> two things: either the job requires a larger size of JVM metaspace to load
>>> classes or there is a class loading leak.
>>> >
>>> > I have 2GB of metaspace configed
>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>> >
>>> > But the task nodes still fail.
>>> >
>>> > When looking at the UI metrics, the metaspace starts low. Now I see
>>> 85% usage. It seems to be a class loading leak at this point, how can we
>>> debug this issue?
>>>
>>>
>>
>


Re: How to debug Metaspace exception?

2022-03-31 Thread huweihua

The difference is that manually canceling the job stops the JobMaster, but 
automatic failover keeps the JobMaster running. But looking on TaskManager, it 
doesn't make much difference


> 2022年3月31日 上午4:01,John Smith  写道:
> 
> Also if I manually cancel and restart the same job over and over is it the 
> same as if flink was restarting a job due to failure?
> 
> I.e: When I click "Cancel Job" on the UI is the job completely unloaded vs 
> when the job scheduler restarts a job because if whatever reason?
> 
> Lile this I'll stop and restart the job a few times or maybe I can trick my 
> job to fail and have the scheduler restart it. Ok let me think about this...
> 
> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华  > wrote:
>> So if I run the same jobs in my dev env will I still be able to see the 
>> similar dump? 
> I think running the same job in dev should be reproducible, maybe you can 
> have a try.
> 
>>  If not I would have to wait at a low volume time to do it on production. 
>> Aldo if I recall the dump is as big as the JVM memory right so if I have 
>> 10GB configed for the JVM the dump will be 10GB file?
> 
> Yes, JMAP will pause the JVM, the time of pause depends on the size to dump. 
> you can use "jmap -dump:live" to dump only the reachable objects, this will 
> take a brief pause
> 
> 
> 
>> 2022年3月30日 下午9:47,John Smith > > 写道:
>> 
>> I have 3 task managers (see config below). There is total of 10 jobs with 25 
>> slots being used.
>> The jobs are 100% ETL I.e; They load Json, transform it and push it to JDBC, 
>> only 1 job of the 10 is pushing to Apache Ignite cluster.
>> 
>> FOR JMAP. I know that it will pause the task manager. So if I run the same 
>> jobs in my dev env will I still be able to see the similar dump? I I assume 
>> so. If not I would have to wait at a low volume time to do it on production. 
>> Aldo if I recall the dump is as big as the JVM memory right so if I have 
>> 10GB configed for the JVM the dump will be 10GB file?
>> 
>> 
>> # Operating system has 16GB total.
>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no
>> 
>> cluster.evenly-spread-out-slots: true
>> 
>> taskmanager.memory.flink.size: 10240m
>> taskmanager.memory.jvm-metaspace.size: 2048m
>> taskmanager.numberOfTaskSlots: 16
>> parallelism.default: 1
>> 
>> high-availability: zookeeper
>> high-availability.storageDir: file:///mnt/flink/ha/flink_1_14/ <>
>> high-availability.zookeeper.quorum: ...
>> high-availability.zookeeper.path.root: /flink_1_14
>> high-availability.cluster-id: /flink_1_14_cluster_0001
>> 
>> web.upload.dir: /mnt/flink/uploads/flink_1_14
>> 
>> state.backend: rocksdb
>> state.backend.incremental: true
>> state.checkpoints.dir: file:///mnt/flink/checkpoints/flink_1_14 <>
>> state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14 <>
>> 
>> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华 > > wrote:
>> Hi, John
>> 
>> Could you tell us you application scenario? Is it a flink session cluster 
>> with a lot of jobs?
>> 
>> Maybe you can try to dump the memory with jmap and use tools such as MAT to 
>> analyze whether there are abnormal classes and classloaders
>> 
>> 
>> > 2022年3月30日 上午6:09,John Smith > > > 写道:
>> > 
>> > Hi running 1.14.4
>> > 
>> > My tasks manager still fails with java.lang.OutOfMemoryError: Metaspace. 
>> > The metaspace out-of-memory error has occurred. This can mean two things: 
>> > either the job requires a larger size of JVM metaspace to load classes or 
>> > there is a class loading leak.
>> > 
>> > I have 2GB of metaspace configed taskmanager.memory.jvm-metaspace.size: 
>> > 2048m
>> > 
>> > But the task nodes still fail.
>> > 
>> > When looking at the UI metrics, the metaspace starts low. Now I see 85% 
>> > usage. It seems to be a class loading leak at this point, how can we debug 
>> > this issue?
>> 
> 



Re: How to debug Metaspace exception?

2022-03-30 Thread John Smith
Also if I manually cancel and restart the same job over and over is it the
same as if flink was restarting a job due to failure?

I.e: When I click "Cancel Job" on the UI is the job completely unloaded vs
when the job scheduler restarts a job because if whatever reason?

Lile this I'll stop and restart the job a few times or maybe I can trick my
job to fail and have the scheduler restart it. Ok let me think about this...

On Wed, Mar 30, 2022 at 10:24 AM 胡伟华  wrote:

> So if I run the same jobs in my dev env will I still be able to see the
> similar dump?
>
> I think running the same job in dev should be reproducible, maybe you can
> have a try.
>
>  If not I would have to wait at a low volume time to do it on production.
> Aldo if I recall the dump is as big as the JVM memory right so if I have
> 10GB configed for the JVM the dump will be 10GB file?
>
> Yes, JMAP will pause the JVM, the time of pause depends on the size to
> dump. you can use "jmap -dump:live" to dump only the reachable objects,
> this will take a brief pause
>
>
>
> 2022年3月30日 下午9:47,John Smith  写道:
>
> I have 3 task managers (see config below). There is total of 10 jobs with
> 25 slots being used.
> The jobs are 100% ETL I.e; They load Json, transform it and push it to
> JDBC, only 1 job of the 10 is pushing to Apache Ignite cluster.
>
> FOR JMAP. I know that it will pause the task manager. So if I run the same
> jobs in my dev env will I still be able to see the similar dump? I I assume
> so. If not I would have to wait at a low volume time to do it on
> production. Aldo if I recall the dump is as big as the JVM memory right so
> if I have 10GB configed for the JVM the dump will be 10GB file?
>
>
> # Operating system has 16GB total.
> env.ssh.opts: -l flink -oStrictHostKeyChecking=no
>
> cluster.evenly-spread-out-slots: true
>
> taskmanager.memory.flink.size: 10240m
> taskmanager.memory.jvm-metaspace.size: 2048m
> taskmanager.numberOfTaskSlots: 16
> parallelism.default: 1
>
> high-availability: zookeeper
> high-availability.storageDir: file:///mnt/flink/ha/flink_1_14/
> high-availability.zookeeper.quorum: ...
> high-availability.zookeeper.path.root: /flink_1_14
> high-availability.cluster-id: /flink_1_14_cluster_0001
>
> web.upload.dir: /mnt/flink/uploads/flink_1_14
>
> state.backend: rocksdb
> state.backend.incremental: true
> state.checkpoints.dir: file:///mnt/flink/checkpoints/flink_1_14
> state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14
>
> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华  wrote:
>
>> Hi, John
>>
>> Could you tell us you application scenario? Is it a flink session cluster
>> with a lot of jobs?
>>
>> Maybe you can try to dump the memory with jmap and use tools such as MAT
>> to analyze whether there are abnormal classes and classloaders
>>
>>
>> > 2022年3月30日 上午6:09,John Smith  写道:
>> >
>> > Hi running 1.14.4
>> >
>> > My tasks manager still fails with java.lang.OutOfMemoryError:
>> Metaspace. The metaspace out-of-memory error has occurred. This can mean
>> two things: either the job requires a larger size of JVM metaspace to load
>> classes or there is a class loading leak.
>> >
>> > I have 2GB of metaspace configed taskmanager.memory.jvm-metaspace.size:
>> 2048m
>> >
>> > But the task nodes still fail.
>> >
>> > When looking at the UI metrics, the metaspace starts low. Now I see 85%
>> usage. It seems to be a class loading leak at this point, how can we debug
>> this issue?
>>
>>
>


Re: How to debug Metaspace exception?

2022-03-30 Thread 胡伟华
> So if I run the same jobs in my dev env will I still be able to see the 
> similar dump? 
I think running the same job in dev should be reproducible, maybe you can have 
a try.

>  If not I would have to wait at a low volume time to do it on production. 
> Aldo if I recall the dump is as big as the JVM memory right so if I have 10GB 
> configed for the JVM the dump will be 10GB file?

Yes, JMAP will pause the JVM, the time of pause depends on the size to dump. 
you can use "jmap -dump:live" to dump only the reachable objects, this will 
take a brief pause



> 2022年3月30日 下午9:47,John Smith  写道:
> 
> I have 3 task managers (see config below). There is total of 10 jobs with 25 
> slots being used.
> The jobs are 100% ETL I.e; They load Json, transform it and push it to JDBC, 
> only 1 job of the 10 is pushing to Apache Ignite cluster.
> 
> FOR JMAP. I know that it will pause the task manager. So if I run the same 
> jobs in my dev env will I still be able to see the similar dump? I I assume 
> so. If not I would have to wait at a low volume time to do it on production. 
> Aldo if I recall the dump is as big as the JVM memory right so if I have 10GB 
> configed for the JVM the dump will be 10GB file?
> 
> 
> # Operating system has 16GB total.
> env.ssh.opts: -l flink -oStrictHostKeyChecking=no
> 
> cluster.evenly-spread-out-slots: true
> 
> taskmanager.memory.flink.size: 10240m
> taskmanager.memory.jvm-metaspace.size: 2048m
> taskmanager.numberOfTaskSlots: 16
> parallelism.default: 1
> 
> high-availability: zookeeper
> high-availability.storageDir: file:///mnt/flink/ha/flink_1_14/
> high-availability.zookeeper.quorum: ...
> high-availability.zookeeper.path.root: /flink_1_14
> high-availability.cluster-id: /flink_1_14_cluster_0001
> 
> web.upload.dir: /mnt/flink/uploads/flink_1_14
> 
> state.backend: rocksdb
> state.backend.incremental: true
> state.checkpoints.dir: file:///mnt/flink/checkpoints/flink_1_14
> state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14
> 
> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华  > wrote:
> Hi, John
> 
> Could you tell us you application scenario? Is it a flink session cluster 
> with a lot of jobs?
> 
> Maybe you can try to dump the memory with jmap and use tools such as MAT to 
> analyze whether there are abnormal classes and classloaders
> 
> 
> > 2022年3月30日 上午6:09,John Smith  > > 写道:
> > 
> > Hi running 1.14.4
> > 
> > My tasks manager still fails with java.lang.OutOfMemoryError: Metaspace. 
> > The metaspace out-of-memory error has occurred. This can mean two things: 
> > either the job requires a larger size of JVM metaspace to load classes or 
> > there is a class loading leak.
> > 
> > I have 2GB of metaspace configed taskmanager.memory.jvm-metaspace.size: 
> > 2048m
> > 
> > But the task nodes still fail.
> > 
> > When looking at the UI metrics, the metaspace starts low. Now I see 85% 
> > usage. It seems to be a class loading leak at this point, how can we debug 
> > this issue?
> 



Re: How to debug Metaspace exception?

2022-03-30 Thread John Smith
I have 3 task managers (see config below). There is total of 10 jobs with
25 slots being used.
The jobs are 100% ETL I.e; They load Json, transform it and push it to
JDBC, only 1 job of the 10 is pushing to Apache Ignite cluster.

FOR JMAP. I know that it will pause the task manager. So if I run the same
jobs in my dev env will I still be able to see the similar dump? I I assume
so. If not I would have to wait at a low volume time to do it on
production. Aldo if I recall the dump is as big as the JVM memory right so
if I have 10GB configed for the JVM the dump will be 10GB file?


# Operating system has 16GB total.
env.ssh.opts: -l flink -oStrictHostKeyChecking=no

cluster.evenly-spread-out-slots: true

taskmanager.memory.flink.size: 10240m
taskmanager.memory.jvm-metaspace.size: 2048m
taskmanager.numberOfTaskSlots: 16
parallelism.default: 1

high-availability: zookeeper
high-availability.storageDir: file:///mnt/flink/ha/flink_1_14/
high-availability.zookeeper.quorum: ...
high-availability.zookeeper.path.root: /flink_1_14
high-availability.cluster-id: /flink_1_14_cluster_0001

web.upload.dir: /mnt/flink/uploads/flink_1_14

state.backend: rocksdb
state.backend.incremental: true
state.checkpoints.dir: file:///mnt/flink/checkpoints/flink_1_14
state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14

On Wed, Mar 30, 2022 at 2:16 AM 胡伟华  wrote:

> Hi, John
>
> Could you tell us you application scenario? Is it a flink session cluster
> with a lot of jobs?
>
> Maybe you can try to dump the memory with jmap and use tools such as MAT
> to analyze whether there are abnormal classes and classloaders
>
>
> > 2022年3月30日 上午6:09,John Smith  写道:
> >
> > Hi running 1.14.4
> >
> > My tasks manager still fails with java.lang.OutOfMemoryError: Metaspace.
> The metaspace out-of-memory error has occurred. This can mean two things:
> either the job requires a larger size of JVM metaspace to load classes or
> there is a class loading leak.
> >
> > I have 2GB of metaspace configed taskmanager.memory.jvm-metaspace.size:
> 2048m
> >
> > But the task nodes still fail.
> >
> > When looking at the UI metrics, the metaspace starts low. Now I see 85%
> usage. It seems to be a class loading leak at this point, how can we debug
> this issue?
>
>


Re: How to debug Metaspace exception?

2022-03-30 Thread 胡伟华
Hi, John

Could you tell us you application scenario? Is it a flink session cluster with 
a lot of jobs?

Maybe you can try to dump the memory with jmap and use tools such as MAT to 
analyze whether there are abnormal classes and classloaders


> 2022年3月30日 上午6:09,John Smith  写道:
> 
> Hi running 1.14.4
> 
> My tasks manager still fails with java.lang.OutOfMemoryError: Metaspace. The 
> metaspace out-of-memory error has occurred. This can mean two things: either 
> the job requires a larger size of JVM metaspace to load classes or there is a 
> class loading leak.
> 
> I have 2GB of metaspace configed taskmanager.memory.jvm-metaspace.size: 2048m
> 
> But the task nodes still fail.
> 
> When looking at the UI metrics, the metaspace starts low. Now I see 85% 
> usage. It seems to be a class loading leak at this point, how can we debug 
> this issue?



How to debug Metaspace exception?

2022-03-29 Thread John Smith
Hi running 1.14.4

My tasks manager still fails with java.lang.OutOfMemoryError: Metaspace.
The metaspace out-of-memory error has occurred. This can mean two things:
either the job requires a larger size of JVM metaspace to load classes or
there is a class loading leak.

I have 2GB of metaspace configed taskmanager.memory.jvm-metaspace.size:
2048m

But the task nodes still fail.

When looking at the UI metrics, the metaspace starts low. Now I see 85%
usage. It seems to be a class loading leak at this point, how can we debug
this issue?