Re: Killing Yarn Session Leaves Lingering Flink Jobs

Maximilian Michels Thu, 28 Jul 2016 05:11:35 -0700

Hi Konstantin,

If you come from traditional on-premise installations it may seem
counter-intuitive to start a Flink cluster for each job. However, in
today's cluster world it is not a problem to request containers on
demand and spawn a new Flink cluster for each job. Per job clusters
are convenient because they can be tailored for the job; you only need
to request as many resources as you need for the job. The typical
on-premise Flink cluster which you get when you start a Yarn session,
has a static resource consumption even when no job is running. On the
other hand, the per-job cluster releases all resources when the job
has finished.


Best,
Max

On Thu, Jul 28, 2016 at 12:28 PM, Konstantin Knauf
<konstantin.kn...@tngtech.com> wrote:
> Hi Stephan,
>
> thank you for this clarification. I have a slightly related follow up
> question. I keep reading that, the preferred way to run Flink on Yarn is
> with "Flink-job-at-a-time-on-yarn". Can you explain this a little
> further? Of course, with separate YARN session the jobs are more
> decoupled, but on the other hand it seems contra-intuitive to start a
> new Flink Cluster for each job.
>
> Best Regards,
>
> Konstantin
>
> On 12.07.2016 15:48, Stephan Ewen wrote:
>> I think there is a confusion between how Flink thinks about HA and job
>> life cycle, and how many users think about it.
>>
>> Flink thinks that a killing of the YARN session is a failure of the job.
>> So as soon as new Yarn resources become available, it tries to recover
>> the job.
>> Most users think that killing a Yarn session is equivalent to canceling
>> the job.
>>
>> I am unsure if we should start to interpret the killing of a Yarn
>> session as a cancellation. Do Yarn sessions never get killed
>> accidentally, or as the result of a Yarn-related failure?
>>
>> Using Flink-job-at-a-time-on-yarn, cancelling the Flink Job also shuts
>> down the Yarn session and hence shuts down everything properly.
>>
>> Hope that train of thought helps.
>>
>>
>> On Tue, Jul 12, 2016 at 3:15 PM, Ufuk Celebi <u...@apache.org
>> <mailto:u...@apache.org>> wrote:
>>
>>     Are you running in HA mode? If yes, that's the expected behaviour at
>>     the moment, because the ZooKeeper data is only cleaned up on a
>>     terminal state (FINISHED, FAILED, CANCELLED). You have to specify
>>     separate ZooKeeper root paths via "recovery.zookeeper.path.root".
>>     There is an issue which should be fixed for 1.2 to make this
>>     configurable in an easy way.
>>
>>     On Tue, Jul 12, 2016 at 1:28 PM, Konstantin Gregor
>>     <konstantin.gre...@tngtech.com
>>     <mailto:konstantin.gre...@tngtech.com>> wrote:
>>     > Hello everyone,
>>     >
>>     > I have a question concerning stopping Flink streaming processes
>>     that run
>>     > in a detached Yarn session.
>>     >
>>     > Here's what we do: We start a Yarn session via
>>     > yarn-session.sh -n 8 -d -jm 4096 -tm 10000 -s 10 -qu flink_queue
>>     >
>>     > Then, we start our Flink streaming application via
>>     > flink run -p 65 -c SomeClass some.jar > /dev/null 2>&1  &
>>     >
>>     > The problem occurs when we stop the application.
>>     > If we stop the Flink application with
>>     > flink cancel <JOB_ID>
>>     > and then kill the yarn application with
>>     > yarn application -kill <APPLICATION_ID>
>>     > everything is fine.
>>     > But what we expected was that when we only kill the yarn application
>>     > without specifically canceling the Flink job before, the Flink job
>>     will
>>     > stay lingering on the machine and use resources until it is killed
>>     > manually via its process id.
>>     >
>>     > One thing that we tried was to stop using ephemeral ports for the
>>     > application-manager, namely we set yarn.application-master.port
>>     > specifically to some port number, but the problem remains: Killing the
>>     > yarn application does not kill the corresponding Flink job.
>>     >
>>     > Does anyone have an idea about this? Any help is greatly
>>     appreciated :-)
>>     > By the way, our application reads data from a Kafka queue and
>>     writes it
>>     > into HDFS, maybe this is also important to know.
>>     >
>>     > Thank you and best regards
>>     >
>>     > Konstantin
>>     > --
>>     > Konstantin Gregor * konstantin.gre...@tngtech.com
>>     <mailto:konstantin.gre...@tngtech.com>
>>     > TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
>>     > Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
>>     > Sitz: Unterföhring * Amtsgericht München * HRB 135082
>>
>>
>
> --
> Konstantin Knauf * konstantin.kn...@tngtech.com * +49-174-3413182
> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
> Sitz: Unterföhring * Amtsgericht München * HRB 135082
>

Re: Killing Yarn Session Leaves Lingering Flink Jobs

Reply via email to