Hi Averell, What Flink version are you using? Can you attach the full logs from JM and TMs? Since Flink 1.5, the -n parameter (number of taskmanagers) should be omitted unless you are in legacy mode [1].
> As per that screenshot, it looks like there are 2 tasks manager still > running (one on each host .88 and .81), which means the one on .88 has not > been cleaned properly. If it is, then how to clean it? The TMs should terminate if they cannot register at the JM [2]. > I wonder whether when the server with JobManager crashes, the whole job is > restarted, or a new JobManager will try to connect to the running TMs to > resume the job? The whole job is restarted but any existing TM containers are reused. Best, Gary [1] https://ci.apache.org/projects/flink/flink-docs-release-1.7/ops/config.html#legacy [2] https://ci.apache.org/projects/flink/flink-docs-release-1.7/ops/config.html#taskmanager-registration-timeout On Wed, Jan 23, 2019 at 7:19 AM Averell <lvhu...@gmail.com> wrote: > Hello everyone, > > I am testing High Availability of Flink on YARN on an AWS EMR cluster. > My configuration is an EMR with one master-node and 3 core-nodes (each with > 16 vCores). Zookeeper is running on all nodes. > Yarn session was created with: flink-yarn-session -n 2 -s 8 -jm 1024m -tm > 20g > A job with parallelism of 16 was submitted. > > I tried to execute the test by terminating the core-node (using Linux "init > 0") having the job-manager running on. The first few restarts worked well - > a new job-manager was elected, and the job was resumed properly. > However, after some restarts, the new job-manager could not retrieve its > needed resource any more (only one TM on the node with IP .81 was shown in > the Task Managers GUI). > < > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1586/Flink.png> > > > I kept getting the error message > > "org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: > Could not allocate all requires slots within timeout of 300000 ms. Slots > required: 108, slots allocated: 60". > > Here below is what shown in YARN Resource Manager. > < > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1586/Yarn.png> > > > As per that screenshot, it looks like there are 2 tasks manager still > running (one on each host .88 and .81), which means the one on .88 has not > been cleaned properly. If it is, then how to clean it? > > I wonder whether when the server with JobManager crashes, the whole job is > restarted, or a new JobManager will try to connect to the running TMs to > resume the job? > > > Thanks and regards, > Averell > > > > > > -- > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ >