Re: help understand/debug high memory footprint on jobmanager

Steven Wu Fri, 29 Jun 2018 16:41:04 -0700

Thanks everyone for jumping in. BTW, we are using flink-1.4.1. deployment
is stand-alone mode.


here is the JIRA: https://issues.apache.org/jira/browse/FLINK-9693

On Fri, Jun 29, 2018 at 12:09 PM, Stephan Ewen <se...@apache.org> wrote:

> Just saw Stefan's response, it is basically the same.
>
> We either null out the field on deploy or archival. On deploy would be
> even more memory friendly.
>
> @Steven - can you open a JIRA ticket for this?
>
> On Fri, Jun 29, 2018 at 9:08 PM, Stephan Ewen <se...@apache.org> wrote:
>
>> The problem seems to be that the Executions that are kept for history
>> (mainly metrics / web UI) still hold a reference to their TaskStateSnapshot.
>>
>> Upon archival, that field needs to be cleared for GC.
>>
>> This is quite clearly a bug...
>>
>> On Fri, Jun 29, 2018 at 11:29 AM, Stefan Richter <
>> s.rich...@data-artisans.com> wrote:
>>
>>> Hi Steven,
>>>
>>> from your analysis, I would conclude the following problem.
>>> ExecutionVertexes hold executions, which are bootstrapped with the state
>>> (in form of the map of state handles) when the job is initialized from a
>>> checkpoint/savepoint. It holds a reference on this state, even when the
>>> task is already running. I would assume it is save to set the reference to
>>> TaskStateSnapshot to null at the end of the deploy() method and can be
>>> GC’ed. From the provided stats, I cannot say if maybe the JM is also
>>> holding references to too many ExecutionVertexes, but that would be a
>>> different story.
>>>
>>> Best,
>>> Stefan
>>>
>>> Am 29.06.2018 um 01:29 schrieb Steven Wu <stevenz...@gmail.com>:
>>>
>>> First, some context about the job
>>> * embarrassingly parallel: all operators are chained together
>>> * parallelism is over 1,000
>>> * stateless except for Kafka source operators. checkpoint size is 8.4 MB.
>>> * set "state.backend.fs.memory-threshold" so that only jobmanager
>>> writes to S3 to checkpoint
>>> * internal checkpoint with 10 checkpoints retained in history
>>>
>>> We don't expect jobmanager to use much memory at all. But it seems that
>>> this high memory footprint (or leak) happened occasionally, maybe under
>>> certain conditions. Any hypothesis?
>>>
>>> Thanks,
>>> Steven
>>>
>>>
>>> 41,567 ExecutionVertex objects retained 9+ GB of memory
>>> <image.png>
>>>
>>>
>>> Expanded in one ExecutionVertex. it seems to storing the kafka offsets
>>> for source operator
>>> <image.png>
>>>
>>>
>>>
>>
>

Re: help understand/debug high memory footprint on jobmanager

Reply via email to