Hi Hayden, in Yarn mode, Flink will tolerate as many TM failures as you have configured `yarn.maximum-failed-containers`. Per default this is set to the initial number of requested TMs. So in your case, the Flink cluster would restart twice a TM and then fail the cluster once a TM fails for the third time.
Cheers, Till On Mon, Dec 4, 2017 at 11:31 AM, Marchant, Hayden <hayden.march...@citi.com> wrote: > Hi, > > WE are currently start to test Flink running on YARN. Till now, we've been > testing on Standalone Cluster. One thing lacking in standalone is that we > have to manually restart a Task Manager if it dies. I looked at > https://ci.apache.org/projects/flink/flink-docs- > release-1.3/setup/jobmanager_high_availability.html#yarn- > cluster-high-availability , and see that YARN deals with HA for Job > Manager. How does it deal with a Task Manager if it dies? I would like the > Task Manager to be dealt with similarly to Job Manager on failure. For > example, let's say I have a cluster with two Task Managers, and one task > manager dies. Will YARN restart the dead Task Manager, or would that need > to be a manual restart? > > What actually would happen in the above case? > > Thanks, > Hayden > > >