Re: Flink 1.2 Jobmanager OOME - CheckpointCoordinators

Ufuk Celebi Tue, 28 Feb 2017 09:17:30 -0800

@Konstantion: Could you share a relevant part of the heap dump just to
get a second look?

The timer tasks are responsible to abort the checkpoint if a
checkpoint timeout occurs. You can decrease the timeout via the
CheckpointConfig
(env.getCheckpointConfig().setCheckpointTimeout(long)), the current
default is 10 mins.

On a first skim of the checkpoint coordinator code I didn't see
anything that cancels these tasks when the checkpoint is fully ack'd.
@Stephan: I think we should do that. What do you think?

On Tue, Feb 28, 2017 at 4:06 PM, Konstantin Knauf
<konstantin.kn...@tngtech.com> wrote:
> Hi everyone,
>
> I am currently running a small Flink job locally, which checkpoints
> every 100ms.
>
> After a few minutes the JM crashes with an OOME. In the Headump I can
> see, that a TimerTask holds references to all completed
> CheckpointCoordinators. I assume this task is supposed to clean these
> checkpoints up eventually.
>
> First, is this the expected behaviour? Second, is there a configuration
> option to trigger this cleanup timer earlier?
>
> Cheers,
>
> Konstantin
>
> --
> Konstantin Knauf * konstantin.kn...@tngtech.com * +49-174-3413182
> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
> Sitz: Unterföhring * Amtsgericht München * HRB 135082
>

Re: Flink 1.2 Jobmanager OOME - CheckpointCoordinators

Reply via email to