Also just to be sure this is a Task Manager setting right?

On Thu, Apr 28, 2022 at 11:13 AM John Smith <java.dev....@gmail.com> wrote:

> I assume you will take action on your side to track and fix the doc? :)
>
> On Thu, Apr 28, 2022 at 11:12 AM John Smith <java.dev....@gmail.com>
> wrote:
>
>> Ok so to summarize...
>>
>> - Build my job jar and have the JDBC driver as a compile only
>> dependency and copy the JDBC driver to flink lib folder.
>>
>> Or
>>
>> - Build my job jar and include JDBC driver in the shadow, plus copy the
>> JDBC driver in the flink lib folder, plus  make an entry in config for
>> classloader.parent-first-patterns-additional
>> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>
>>
>>
>> On Thu, Apr 28, 2022 at 10:17 AM Chesnay Schepler <ches...@apache.org>
>> wrote:
>>
>>> I think what I meant was "either add it to /lib, or [if it is already in
>>> /lib but also bundled in the jar] add it to the parent-first patterns."
>>>
>>> On 28/04/2022 15:56, Chesnay Schepler wrote:
>>>
>>> Pretty sure, even though I seemingly documented it incorrectly :)
>>>
>>> On 28/04/2022 15:49, John Smith wrote:
>>>
>>> You sure?
>>>
>>>    -
>>>
>>>    *JDBC*: JDBC drivers leak references outside the user code
>>>    classloader. To ensure that these classes are only loaded once you should
>>>    either add the driver jars to Flink’s lib/ folder, or add the driver
>>>    classes to the list of parent-first loaded class via
>>>    classloader.parent-first-patterns-additional
>>>    
>>> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>
>>>    .
>>>
>>>    It says either or
>>>
>>>
>>> On Wed, Apr 27, 2022 at 3:44 AM Chesnay Schepler <ches...@apache.org>
>>> wrote:
>>>
>>>> You're misinterpreting the docs.
>>>>
>>>> The parent/child-first classloading controls where Flink looks for a
>>>> class *first*, specifically whether we first load from /lib or the
>>>> user-jar.
>>>> It does not allow you to load something from the user-jar in the parent
>>>> classloader. That's just not how it works.
>>>>
>>>> It must be in /lib.
>>>>
>>>> On 27/04/2022 04:59, John Smith wrote:
>>>>
>>>> Hi Chesnay as per the docs...
>>>> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/
>>>>
>>>> You can either put the jars in task manager lib folder or use
>>>> classloader.parent-first-patterns-additional
>>>> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>
>>>>
>>>> I prefer the latter like this: the dependency stays with the user-jar
>>>> and not on the task manager.
>>>>
>>>> On Tue, Apr 26, 2022 at 9:52 PM John Smith <java.dev....@gmail.com>
>>>> wrote:
>>>>
>>>>> Ok so I should put the Apache ignite and my Microsoft drivers in the
>>>>> lib folders of my task managers?
>>>>>
>>>>> And then in my job jar only include them as compile time dependencies?
>>>>>
>>>>>
>>>>> On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler <ches...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> JDBC drivers are well-known for leaking classloaders unfortunately.
>>>>>>
>>>>>> You have correctly identified your alternatives.
>>>>>>
>>>>>> You must put the jdbc driver into /lib instead. Setting only the
>>>>>> parent-first pattern shouldn't affect anything.
>>>>>> That is only relevant if something is in both in /lib and the
>>>>>> user-jar, telling Flink to prioritize what is in lib.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 26/04/2022 15:35, John Smith wrote:
>>>>>>
>>>>>> So I put classloader.parent-first-patterns.additional:
>>>>>> "org.apache.ignite." in the task config and so far I don't think I'm
>>>>>> getting "java.lang.OutOfMemoryError: Metaspace" any more.
>>>>>>
>>>>>> Or it's too early to tell.
>>>>>>
>>>>>> Though now, the task managers are shutting down due to some
>>>>>> other failures.
>>>>>>
>>>>>> So maybe because tasks were failing and reloading often the task
>>>>>> manager was running out of Metspace. But now maybe it's just
>>>>>> cleanly shutting down.
>>>>>>
>>>>>> On Wed, Apr 20, 2022 at 11:35 AM John Smith <java.dev....@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Or I can put in the config to treat org.apache.ignite. classes as
>>>>>>> first class?
>>>>>>>
>>>>>>> On Tue, Apr 19, 2022 at 10:18 PM John Smith <java.dev....@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Ok, so I loaded the dump into Eclipse Mat and followed:
>>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>>>>>>
>>>>>>>> - On the Histogram, I got over 30 entries for: ChildFirstClassLoader
>>>>>>>> - Then I clicked on one of them "Merge Shortest Path..." and picked
>>>>>>>> "Exclude all phantom/weak/soft references"
>>>>>>>> - Which then gave me: SqlDriverManager > Apache Ignite JdbcThin
>>>>>>>> Driver
>>>>>>>>
>>>>>>>> So i'm guessing anything JDBC based. I should copy into the task
>>>>>>>> manager libs folder and my jobs make the dependencies as compile only?
>>>>>>>>
>>>>>>>> On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko <
>>>>>>>> yaros...@goldsky.io> wrote:
>>>>>>>>
>>>>>>>>> Also
>>>>>>>>> https://shopify.engineering/optimizing-apache-flink-applications-tips
>>>>>>>>> might be helpful (has a section on profiling, as well as 
>>>>>>>>> classloading).
>>>>>>>>>
>>>>>>>>> On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler <
>>>>>>>>> ches...@apache.org> wrote:
>>>>>>>>>
>>>>>>>>>> We have a very rough "guide" in the wiki (it's just the specific
>>>>>>>>>> steps I took to debug another leak):
>>>>>>>>>>
>>>>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>>>>>>>>
>>>>>>>>>> On 19/04/2022 12:01, huweihua wrote:
>>>>>>>>>>
>>>>>>>>>> Hi, John
>>>>>>>>>>
>>>>>>>>>> Sorry for the late reply. You can use MAT[1] to analyze the dump
>>>>>>>>>> file. Check whether have too many loaded classes.
>>>>>>>>>>
>>>>>>>>>> [1] https://www.eclipse.org/mat/
>>>>>>>>>>
>>>>>>>>>> 2022年4月18日 下午9:55,John Smith <java.dev....@gmail.com> 写道:
>>>>>>>>>>
>>>>>>>>>> Hi, can anyone help with this? I never looked at a dump file
>>>>>>>>>> before.
>>>>>>>>>>
>>>>>>>>>> On Thu, Apr 14, 2022 at 11:59 AM John Smith <
>>>>>>>>>> java.dev....@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi, so I have a dump file. What do I look for?
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Mar 31, 2022 at 3:28 PM John Smith <
>>>>>>>>>>> java.dev....@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Ok so if there's a leak, if I manually stop the job and restart
>>>>>>>>>>>> it from the UI multiple times, I won't see the issue because 
>>>>>>>>>>>> because the
>>>>>>>>>>>> classes are unloaded correctly?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Mar 31, 2022 at 9:20 AM huweihua <
>>>>>>>>>>>> huweihua....@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> The difference is that manually canceling the job stops the
>>>>>>>>>>>>> JobMaster, but automatic failover keeps the JobMaster running. 
>>>>>>>>>>>>> But looking
>>>>>>>>>>>>> on TaskManager, it doesn't make much difference
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2022年3月31日 上午4:01,John Smith <java.dev....@gmail.com> 写道:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Also if I manually cancel and restart the same job over and
>>>>>>>>>>>>> over is it the same as if flink was restarting a job due to 
>>>>>>>>>>>>> failure?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I.e: When I click "Cancel Job" on the UI is the job completely
>>>>>>>>>>>>> unloaded vs when the job scheduler restarts a job because if 
>>>>>>>>>>>>> whatever
>>>>>>>>>>>>> reason?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Lile this I'll stop and restart the job a few times or maybe I
>>>>>>>>>>>>> can trick my job to fail and have the scheduler restart it. Ok 
>>>>>>>>>>>>> let me think
>>>>>>>>>>>>> about this...
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 <huweihua....@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> So if I run the same jobs in my dev env will I still be able
>>>>>>>>>>>>>> to see the similar dump?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think running the same job in dev should be reproducible,
>>>>>>>>>>>>>> maybe you can have a try.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  If not I would have to wait at a low volume time to do it on
>>>>>>>>>>>>>> production. Aldo if I recall the dump is as big as the JVM 
>>>>>>>>>>>>>> memory right so
>>>>>>>>>>>>>> if I have 10GB configed for the JVM the dump will be 10GB file?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes, JMAP will pause the JVM, the time of pause depends on
>>>>>>>>>>>>>> the size to dump. you can use "jmap -dump:live" to dump only the 
>>>>>>>>>>>>>> reachable
>>>>>>>>>>>>>> objects, this will take a brief pause
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2022年3月30日 下午9:47,John Smith <java.dev....@gmail.com> 写道:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have 3 task managers (see config below). There is total of
>>>>>>>>>>>>>> 10 jobs with 25 slots being used.
>>>>>>>>>>>>>> The jobs are 100% ETL I.e; They load Json, transform it and
>>>>>>>>>>>>>> push it to JDBC, only 1 job of the 10 is pushing to Apache 
>>>>>>>>>>>>>> Ignite cluster.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> FOR JMAP. I know that it will pause the task manager. So if I
>>>>>>>>>>>>>> run the same jobs in my dev env will I still be able to see the 
>>>>>>>>>>>>>> similar
>>>>>>>>>>>>>> dump? I I assume so. If not I would have to wait at a low volume 
>>>>>>>>>>>>>> time to do
>>>>>>>>>>>>>> it on production. Aldo if I recall the dump is as big as the JVM 
>>>>>>>>>>>>>> memory
>>>>>>>>>>>>>> right so if I have 10GB configed for the JVM the dump will be 
>>>>>>>>>>>>>> 10GB file?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> # Operating system has 16GB total.
>>>>>>>>>>>>>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> cluster.evenly-spread-out-slots: true
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> taskmanager.memory.flink.size: 10240m
>>>>>>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>>>>>>>>>>> taskmanager.numberOfTaskSlots: 16
>>>>>>>>>>>>>> parallelism.default: 1
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> high-availability: zookeeper
>>>>>>>>>>>>>> high-availability.storageDir:
>>>>>>>>>>>>>> file:///mnt/flink/ha/flink_1_14/
>>>>>>>>>>>>>> high-availability.zookeeper.quorum: ...
>>>>>>>>>>>>>> high-availability.zookeeper.path.root: /flink_1_14
>>>>>>>>>>>>>> high-availability.cluster-id: /flink_1_14_cluster_0001
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> web.upload.dir: /mnt/flink/uploads/flink_1_14
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> state.backend: rocksdb
>>>>>>>>>>>>>> state.backend.incremental: true
>>>>>>>>>>>>>> state.checkpoints.dir:
>>>>>>>>>>>>>> file:///mnt/flink/checkpoints/flink_1_14
>>>>>>>>>>>>>> state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华 <huweihua....@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi, John
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Could you tell us you application scenario? Is it a flink
>>>>>>>>>>>>>>> session cluster with a lot of jobs?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Maybe you can try to dump the memory with jmap and use tools
>>>>>>>>>>>>>>> such as MAT to analyze whether there are abnormal classes and 
>>>>>>>>>>>>>>> classloaders
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> > 2022年3月30日 上午6:09,John Smith <java.dev....@gmail.com> 写道:
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > Hi running 1.14.4
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > My tasks manager still fails with
>>>>>>>>>>>>>>> java.lang.OutOfMemoryError: Metaspace. The metaspace 
>>>>>>>>>>>>>>> out-of-memory error
>>>>>>>>>>>>>>> has occurred. This can mean two things: either the job requires 
>>>>>>>>>>>>>>> a larger
>>>>>>>>>>>>>>> size of JVM metaspace to load classes or there is a class 
>>>>>>>>>>>>>>> loading leak.
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > I have 2GB of metaspace configed
>>>>>>>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > But the task nodes still fail.
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > When looking at the UI metrics, the metaspace starts low.
>>>>>>>>>>>>>>> Now I see 85% usage. It seems to be a class loading leak at 
>>>>>>>>>>>>>>> this point, how
>>>>>>>>>>>>>>> can we debug this issue?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>
>>>>
>>>
>>>

Reply via email to