Re: No resource available error while testing HA

Gary Yao Wed, 23 Jan 2019 02:36:23 -0800

Hi Averell,

What Flink version are you using? Can you attach the full logs from JM and
TMs? Since Flink 1.5, the -n parameter (number of taskmanagers) should be
omitted unless you are in legacy mode [1].


> As per that screenshot, it looks like there are 2 tasks manager still
> running (one on each host .88 and .81), which means the one on .88 has not
> been cleaned properly. If it is, then how to clean it?

The TMs should terminate if they cannot register at the JM [2].

> I wonder whether when the server with JobManager crashes, the whole job is
> restarted, or a new JobManager will try to connect to the running TMs to
> resume the job?

The whole job is restarted but any existing TM containers are reused.

Best,
Gary

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.7/ops/config.html#legacy
[2]
https://ci.apache.org/projects/flink/flink-docs-release-1.7/ops/config.html#taskmanager-registration-timeout

On Wed, Jan 23, 2019 at 7:19 AM Averell <lvhu...@gmail.com> wrote:

> Hello everyone,
>
> I am testing High Availability of Flink on YARN on an AWS EMR cluster.
> My configuration is an EMR with one master-node and 3 core-nodes (each with
> 16 vCores). Zookeeper is running on all nodes.
> Yarn session was created with: flink-yarn-session -n 2 -s 8 -jm 1024m -tm
> 20g
> A job with parallelism of 16 was submitted.
>
> I tried to execute the test by terminating the core-node (using Linux "init
> 0") having the job-manager running on. The first few restarts worked well -
> a new job-manager was elected, and the job was resumed properly.
> However, after some restarts, the new job-manager could not retrieve its
> needed resource any more (only one TM on the node with IP .81 was shown in
> the Task Managers GUI).
> <
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1586/Flink.png>
>
>
> I kept getting the error message
>
> "org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> Could not allocate all requires slots within timeout of 300000 ms. Slots
> required: 108, slots allocated: 60".
>
> Here below is what shown in YARN Resource Manager.
> <
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1586/Yarn.png>
>
>
> As per that screenshot, it looks like there are 2 tasks manager still
> running (one on each host .88 and .81), which means the one on .88 has not
> been cleaned properly. If it is, then how to clean it?
>
> I wonder whether when the server with JobManager crashes, the whole job is
> restarted, or a new JobManager will try to connect to the running TMs to
> resume the job?
>
>
> Thanks and regards,
> Averell
>
>
>
>
>
> --
> Sent from:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>

Re: No resource available error while testing HA

Reply via email to