Re: Killing Yarn Session Leaves Lingering Flink Jobs

Stephan Ewen Tue, 12 Jul 2016 06:49:19 -0700

I think there is a confusion between how Flink thinks about HA and job life
cycle, and how many users think about it.


Flink thinks that a killing of the YARN session is a failure of the job. So
as soon as new Yarn resources become available, it tries to recover the job.
Most users think that killing a Yarn session is equivalent to canceling the
job.

I am unsure if we should start to interpret the killing of a Yarn session
as a cancellation. Do Yarn sessions never get killed accidentally, or as
the result of a Yarn-related failure?

Using Flink-job-at-a-time-on-yarn, cancelling the Flink Job also shuts down
the Yarn session and hence shuts down everything properly.

Hope that train of thought helps.


On Tue, Jul 12, 2016 at 3:15 PM, Ufuk Celebi <u...@apache.org> wrote:

> Are you running in HA mode? If yes, that's the expected behaviour at
> the moment, because the ZooKeeper data is only cleaned up on a
> terminal state (FINISHED, FAILED, CANCELLED). You have to specify
> separate ZooKeeper root paths via "recovery.zookeeper.path.root".
> There is an issue which should be fixed for 1.2 to make this
> configurable in an easy way.
>
> On Tue, Jul 12, 2016 at 1:28 PM, Konstantin Gregor
> <konstantin.gre...@tngtech.com> wrote:
> > Hello everyone,
> >
> > I have a question concerning stopping Flink streaming processes that run
> > in a detached Yarn session.
> >
> > Here's what we do: We start a Yarn session via
> > yarn-session.sh -n 8 -d -jm 4096 -tm 10000 -s 10 -qu flink_queue
> >
> > Then, we start our Flink streaming application via
> > flink run -p 65 -c SomeClass some.jar > /dev/null 2>&1  &
> >
> > The problem occurs when we stop the application.
> > If we stop the Flink application with
> > flink cancel <JOB_ID>
> > and then kill the yarn application with
> > yarn application -kill <APPLICATION_ID>
> > everything is fine.
> > But what we expected was that when we only kill the yarn application
> > without specifically canceling the Flink job before, the Flink job will
> > stay lingering on the machine and use resources until it is killed
> > manually via its process id.
> >
> > One thing that we tried was to stop using ephemeral ports for the
> > application-manager, namely we set yarn.application-master.port
> > specifically to some port number, but the problem remains: Killing the
> > yarn application does not kill the corresponding Flink job.
> >
> > Does anyone have an idea about this? Any help is greatly appreciated :-)
> > By the way, our application reads data from a Kafka queue and writes it
> > into HDFS, maybe this is also important to know.
> >
> > Thank you and best regards
> >
> > Konstantin
> > --
> > Konstantin Gregor * konstantin.gre...@tngtech.com
> > TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
> > Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
> > Sitz: Unterföhring * Amtsgericht München * HRB 135082
>

Re: Killing Yarn Session Leaves Lingering Flink Jobs

Reply via email to