Thanks everyone for jumping in. BTW, we are using flink-1.4.1. deployment is stand-alone mode.
here is the JIRA: https://issues.apache.org/jira/browse/FLINK-9693 On Fri, Jun 29, 2018 at 12:09 PM, Stephan Ewen <se...@apache.org> wrote: > Just saw Stefan's response, it is basically the same. > > We either null out the field on deploy or archival. On deploy would be > even more memory friendly. > > @Steven - can you open a JIRA ticket for this? > > On Fri, Jun 29, 2018 at 9:08 PM, Stephan Ewen <se...@apache.org> wrote: > >> The problem seems to be that the Executions that are kept for history >> (mainly metrics / web UI) still hold a reference to their TaskStateSnapshot. >> >> Upon archival, that field needs to be cleared for GC. >> >> This is quite clearly a bug... >> >> On Fri, Jun 29, 2018 at 11:29 AM, Stefan Richter < >> s.rich...@data-artisans.com> wrote: >> >>> Hi Steven, >>> >>> from your analysis, I would conclude the following problem. >>> ExecutionVertexes hold executions, which are bootstrapped with the state >>> (in form of the map of state handles) when the job is initialized from a >>> checkpoint/savepoint. It holds a reference on this state, even when the >>> task is already running. I would assume it is save to set the reference to >>> TaskStateSnapshot to null at the end of the deploy() method and can be >>> GC’ed. From the provided stats, I cannot say if maybe the JM is also >>> holding references to too many ExecutionVertexes, but that would be a >>> different story. >>> >>> Best, >>> Stefan >>> >>> Am 29.06.2018 um 01:29 schrieb Steven Wu <stevenz...@gmail.com>: >>> >>> First, some context about the job >>> * embarrassingly parallel: all operators are chained together >>> * parallelism is over 1,000 >>> * stateless except for Kafka source operators. checkpoint size is 8.4 MB. >>> * set "state.backend.fs.memory-threshold" so that only jobmanager >>> writes to S3 to checkpoint >>> * internal checkpoint with 10 checkpoints retained in history >>> >>> We don't expect jobmanager to use much memory at all. But it seems that >>> this high memory footprint (or leak) happened occasionally, maybe under >>> certain conditions. Any hypothesis? >>> >>> Thanks, >>> Steven >>> >>> >>> 41,567 ExecutionVertex objects retained 9+ GB of memory >>> <image.png> >>> >>> >>> Expanded in one ExecutionVertex. it seems to storing the kafka offsets >>> for source operator >>> <image.png> >>> >>> >>> >> >