I think there is a confusion between how Flink thinks about HA and job life cycle, and how many users think about it.
Flink thinks that a killing of the YARN session is a failure of the job. So as soon as new Yarn resources become available, it tries to recover the job. Most users think that killing a Yarn session is equivalent to canceling the job. I am unsure if we should start to interpret the killing of a Yarn session as a cancellation. Do Yarn sessions never get killed accidentally, or as the result of a Yarn-related failure? Using Flink-job-at-a-time-on-yarn, cancelling the Flink Job also shuts down the Yarn session and hence shuts down everything properly. Hope that train of thought helps. On Tue, Jul 12, 2016 at 3:15 PM, Ufuk Celebi <u...@apache.org> wrote: > Are you running in HA mode? If yes, that's the expected behaviour at > the moment, because the ZooKeeper data is only cleaned up on a > terminal state (FINISHED, FAILED, CANCELLED). You have to specify > separate ZooKeeper root paths via "recovery.zookeeper.path.root". > There is an issue which should be fixed for 1.2 to make this > configurable in an easy way. > > On Tue, Jul 12, 2016 at 1:28 PM, Konstantin Gregor > <konstantin.gre...@tngtech.com> wrote: > > Hello everyone, > > > > I have a question concerning stopping Flink streaming processes that run > > in a detached Yarn session. > > > > Here's what we do: We start a Yarn session via > > yarn-session.sh -n 8 -d -jm 4096 -tm 10000 -s 10 -qu flink_queue > > > > Then, we start our Flink streaming application via > > flink run -p 65 -c SomeClass some.jar > /dev/null 2>&1 & > > > > The problem occurs when we stop the application. > > If we stop the Flink application with > > flink cancel <JOB_ID> > > and then kill the yarn application with > > yarn application -kill <APPLICATION_ID> > > everything is fine. > > But what we expected was that when we only kill the yarn application > > without specifically canceling the Flink job before, the Flink job will > > stay lingering on the machine and use resources until it is killed > > manually via its process id. > > > > One thing that we tried was to stop using ephemeral ports for the > > application-manager, namely we set yarn.application-master.port > > specifically to some port number, but the problem remains: Killing the > > yarn application does not kill the corresponding Flink job. > > > > Does anyone have an idea about this? Any help is greatly appreciated :-) > > By the way, our application reads data from a Kafka queue and writes it > > into HDFS, maybe this is also important to know. > > > > Thank you and best regards > > > > Konstantin > > -- > > Konstantin Gregor * konstantin.gre...@tngtech.com > > TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring > > Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke > > Sitz: Unterföhring * Amtsgericht München * HRB 135082 >