Hi Komal, could you check that every node can reach the other nodes? It looks a little bit as if the TaskManager cannot talk to the JobManager running on 150.82.218.218:6123.
Cheers, Till On Thu, Sep 12, 2019 at 9:30 AM Komal Mariam <komal.mar...@gmail.com> wrote: > I managed to fix it however ran into another problem that I could > appreciate help in resolving. > > it turns out that the username for all three nodes was different. having > the same username for them fixed the issue. i.e > same_username@slave-node2-hostname > same_username@slave-node3-hostname > same_username@master-node1-hostname > > Infact, because the usernames are the same, I can just save them in the > conf files as: > slave-node2-hostname > slave-node3-hostname > master-node1-hostname > > However, for some reason my worker nodes dont show up in the available > task manager in the web UI. > > The taskexecutor log says the following: > ... (clipped for brevity) > 2019-09-12 15:56:36,625 INFO > org.apache.flink.runtime.taskexecutor.TaskManagerRunner - > -------------------------------------------------------------------------------- > 2019-09-12 15:56:36,631 INFO > org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Registered > UNIX signal handlers for [TERM, HUP, INT] > 2019-09-12 15:56:36,647 INFO > org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Maximum > number of open file descriptors is 1048576. > 2019-09-12 15:56:36,710 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: jobmanager.rpc.address, 150.82.218.218 > 2019-09-12 15:56:36,711 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: jobmanager.rpc.port, 6123 > 2019-09-12 15:56:36,712 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: jobmanager.heap.size, 1024m > 2019-09-12 15:56:36,713 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: taskmanager.heap.size, 1024m > 2019-09-12 15:56:36,714 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: taskmanager.numberOfTaskSlots, 1 > 2019-09-12 15:56:36,715 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: parallelism.default, 1 > 2019-09-12 15:56:36,717 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: jobmanager.execution.failover-strategy, region > 2019-09-12 15:56:37,097 INFO org.apache.flink.core.fs.FileSystem > - Hadoop is not in the classpath/dependencies. The > extended set of supported File Systems via Hadoop is not available. > 2019-09-12 15:56:37,221 INFO > org.apache.flink.runtime.security.modules.HadoopModuleFactory - Cannot > create Hadoop Security Module because Hadoop cannot be found in the > Classpath. > 2019-09-12 15:56:37,305 INFO > org.apache.flink.runtime.security.SecurityUtils - Cannot > install HadoopSecurityContext because Hadoop cannot be found in the > Classpath. > 2019-09-12 15:56:38,142 INFO org.apache.flink.configuration.Configuration > - Config uses fallback configuration key > 'jobmanager.rpc.address' instead of key 'rest.address' > 2019-09-12 15:56:38,169 INFO > org.apache.flink.runtime.util.LeaderRetrievalUtils - Trying to > select the network interface and address to use by connecting to the > leading JobManager. > 2019-09-12 15:56:38,170 INFO > org.apache.flink.runtime.util.LeaderRetrievalUtils - > TaskManager will try to connect for 10000 milliseconds before falling back > to heuristics > 2019-09-12 15:56:38,185 INFO org.apache.flink.runtime.net.ConnectionUtils > - Retrieved new target address /150.82.218.218:6123. > 2019-09-12 15:56:39,691 INFO org.apache.flink.runtime.net.ConnectionUtils > - Trying to connect to address /150.82.218.218:6123 > 2019-09-12 15:56:39,693 INFO org.apache.flink.runtime.net.ConnectionUtils > - Failed to connect from address 'salman-hpc/127.0.1.1': > Invalid argument (connect failed) > 2019-09-12 15:56:39,696 INFO org.apache.flink.runtime.net.ConnectionUtils > - Failed to connect from address '/150.82.219.73': No > route to host (Host unreachable) > 2019-09-12 15:56:39,698 INFO org.apache.flink.runtime.net.ConnectionUtils > - Failed to connect from address > '/fe80:0:0:0:1e10:83f4:a33a:a208%enp5s0f1': Network is unreachable (connect > failed) > 2019-09-12 15:56:39,748 INFO org.apache.flink.runtime.net.ConnectionUtils > - Failed to connect from address '/150.82.219.73': > connect timed out > 2019-09-12 15:56:39,750 INFO org.apache.flink.runtime.net.ConnectionUtils > - Failed to connect from address '/0:0:0:0:0:0:0:1%lo': > Network is unreachable (connect failed) > 2019-09-12 15:56:39,751 INFO org.apache.flink.runtime.net.ConnectionUtils > - Failed to connect from address '/127.0.0.1': Invalid > argument (connect failed) > 2019-09-12 15:56:39,753 INFO org.apache.flink.runtime.net.ConnectionUtils > - Failed to connect from address > '/fe80:0:0:0:1e10:83f4:a33a:a208%enp5s0f1': Network is unreachable (connect > failed) > "flink-komal-taskexecutor-0-salman-hpc.log" 157L, 29954C > > I'd appreciate help regarding the issue. > > Best Regards, > Komal > > > On Wed, 11 Sep 2019 at 14:13, Komal Mariam <komal.mar...@gmail.com> wrote: > >> I'm trying to set up a 3 node Flink cluster (version 1.9) on the >> following machines: >> >> Node 1 (Master) : 4 GB (3.8 GB) Core2 Duo 2.80GHz, Ubuntu 16.04 LTS >> Node 2 (Slave) : 16 GB, Core i7-3.40GHz, Ubuntu 16.04 LTS >> Node 3 (Slave) : 16 GB, Core i7-3,40GHz, Ubuntu 16.04 LTS >> >> I have followed the instructions on: >> https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/cluster_setup.html >> >> I have defined the IP/address of "jobmanager.rpc.address" in >> conf/flink-conf.yaml in the follwoing format: master@master-node1-hostname >> >> Slaves as conf/slaves: slave@slave-node2-hostname >> slave@slave-node3-hostname >> master@master-node1-hostname (using master >> machine for task execution too) >> >> >> However my problem is when running bin/start-cluster.sh on Master node, >> it fails to start taskexecutor daemon on* both Slave nodes.* It only >> starts both taskexecutor daemon and standalonesession daemon on >> master@master-node1-hostname (Node 1) >> >> I have tried both passwordless ssh and password ssh on all machines but >> the result is the same. In the latter case, it does ask for >> slave@slave-node2-hostname, slave@slave-node3-hostname passowords but >> fails to display any message like "starting taskexecutor daemon on xxxx" >> after that. >> >> I switched my master node to Node 2 and set Node 1 to slave. It was able >> to start taskexecutor daemons on* both Node 2 and Node 3 *successfully >> but did nothing for Node 1. >> >> I'd appreciate if you can advice on what the problem here could be and >> how I can resolve it. >> >> Best Regards, >> Komal >> >> >> >> >>