Re: Problem starting taskexecutor daemons in 3 node cluster

Till Rohrmann Fri, 13 Sep 2019 02:32:39 -0700

Hi Komal,

could you check that every node can reach the other nodes? It looks a
little bit as if the TaskManager cannot talk to the JobManager running on
150.82.218.218:6123.


Cheers,
Till

On Thu, Sep 12, 2019 at 9:30 AM Komal Mariam <komal.mar...@gmail.com> wrote:

> I managed to fix it however ran into another problem that I could
> appreciate help in resolving.
>
> it turns out that the username for all three nodes was different. having
> the same username for them fixed the issue. i.e
> same_username@slave-node2-hostname
> same_username@slave-node3-hostname
> same_username@master-node1-hostname
>
> Infact, because the usernames are the same, I can just save them in the
> conf files as:
> slave-node2-hostname
> slave-node3-hostname
> master-node1-hostname
>
> However, for some reason my worker nodes dont show up in the available
> task manager in the web UI.
>
> The taskexecutor log says the following:
> ... (clipped for brevity)
> 2019-09-12 15:56:36,625 INFO
>  org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -
> --------------------------------------------------------------------------------
> 2019-09-12 15:56:36,631 INFO
>  org.apache.flink.runtime.taskexecutor.TaskManagerRunner       - Registered
> UNIX signal handlers for [TERM, HUP, INT]
> 2019-09-12 15:56:36,647 INFO
>  org.apache.flink.runtime.taskexecutor.TaskManagerRunner       - Maximum
> number of open file descriptors is 1048576.
> 2019-09-12 15:56:36,710 INFO
>  org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: jobmanager.rpc.address, 150.82.218.218
> 2019-09-12 15:56:36,711 INFO
>  org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: jobmanager.rpc.port, 6123
> 2019-09-12 15:56:36,712 INFO
>  org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: jobmanager.heap.size, 1024m
> 2019-09-12 15:56:36,713 INFO
>  org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: taskmanager.heap.size, 1024m
> 2019-09-12 15:56:36,714 INFO
>  org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: taskmanager.numberOfTaskSlots, 1
> 2019-09-12 15:56:36,715 INFO
>  org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: parallelism.default, 1
> 2019-09-12 15:56:36,717 INFO
>  org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: jobmanager.execution.failover-strategy, region
> 2019-09-12 15:56:37,097 INFO  org.apache.flink.core.fs.FileSystem
>                   - Hadoop is not in the classpath/dependencies. The
> extended set of supported File Systems via Hadoop is not available.
> 2019-09-12 15:56:37,221 INFO
>  org.apache.flink.runtime.security.modules.HadoopModuleFactory  - Cannot
> create Hadoop Security Module because Hadoop cannot be found in the
> Classpath.
> 2019-09-12 15:56:37,305 INFO
>  org.apache.flink.runtime.security.SecurityUtils               - Cannot
> install HadoopSecurityContext because Hadoop cannot be found in the
> Classpath.
> 2019-09-12 15:56:38,142 INFO  org.apache.flink.configuration.Configuration
>                  - Config uses fallback configuration key
> 'jobmanager.rpc.address' instead of key 'rest.address'
> 2019-09-12 15:56:38,169 INFO
>  org.apache.flink.runtime.util.LeaderRetrievalUtils            - Trying to
> select the network interface and address to use by connecting to the
> leading JobManager.
> 2019-09-12 15:56:38,170 INFO
>  org.apache.flink.runtime.util.LeaderRetrievalUtils            -
> TaskManager will try to connect for 10000 milliseconds before falling back
> to heuristics
> 2019-09-12 15:56:38,185 INFO  org.apache.flink.runtime.net.ConnectionUtils
>                  - Retrieved new target address /150.82.218.218:6123.
> 2019-09-12 15:56:39,691 INFO  org.apache.flink.runtime.net.ConnectionUtils
>                  - Trying to connect to address /150.82.218.218:6123
> 2019-09-12 15:56:39,693 INFO  org.apache.flink.runtime.net.ConnectionUtils
>                  - Failed to connect from address 'salman-hpc/127.0.1.1':
> Invalid argument (connect failed)
> 2019-09-12 15:56:39,696 INFO  org.apache.flink.runtime.net.ConnectionUtils
>                  - Failed to connect from address '/150.82.219.73': No
> route to host (Host unreachable)
> 2019-09-12 15:56:39,698 INFO  org.apache.flink.runtime.net.ConnectionUtils
>                  - Failed to connect from address
> '/fe80:0:0:0:1e10:83f4:a33a:a208%enp5s0f1': Network is unreachable (connect
> failed)
> 2019-09-12 15:56:39,748 INFO  org.apache.flink.runtime.net.ConnectionUtils
>                  - Failed to connect from address '/150.82.219.73':
> connect timed out
> 2019-09-12 15:56:39,750 INFO  org.apache.flink.runtime.net.ConnectionUtils
>                  - Failed to connect from address '/0:0:0:0:0:0:0:1%lo':
> Network is unreachable (connect failed)
> 2019-09-12 15:56:39,751 INFO  org.apache.flink.runtime.net.ConnectionUtils
>                  - Failed to connect from address '/127.0.0.1': Invalid
> argument (connect failed)
> 2019-09-12 15:56:39,753 INFO  org.apache.flink.runtime.net.ConnectionUtils
>                  - Failed to connect from address
> '/fe80:0:0:0:1e10:83f4:a33a:a208%enp5s0f1': Network is unreachable (connect
> failed)
> "flink-komal-taskexecutor-0-salman-hpc.log" 157L, 29954C
>
> I'd appreciate help regarding the issue.
>
> Best Regards,
> Komal
>
>
> On Wed, 11 Sep 2019 at 14:13, Komal Mariam <komal.mar...@gmail.com> wrote:
>
>> I'm trying to set up a 3 node Flink cluster (version 1.9) on the
>> following machines:
>>
>> Node 1 (Master) : 4 GB (3.8 GB) Core2 Duo 2.80GHz,  Ubuntu 16.04 LTS
>> Node 2 (Slave) : 16 GB, Core i7-3.40GHz, Ubuntu 16.04 LTS
>> Node 3 (Slave) : 16 GB, Core i7-3,40GHz, Ubuntu 16.04 LTS
>>
>> I have followed the instructions on:
>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/cluster_setup.html
>>
>> I have defined the IP/address of "jobmanager.rpc.address" in
>> conf/flink-conf.yaml in the follwoing format: master@master-node1-hostname
>>
>> Slaves as conf/slaves:  slave@slave-node2-hostname
>>                         slave@slave-node3-hostname
>>                         master@master-node1-hostname (using master
>> machine for task execution too)
>>
>>
>> However my problem is when running bin/start-cluster.sh on Master node,
>> it fails to start taskexecutor daemon on* both Slave nodes.* It only
>> starts both taskexecutor daemon and standalonesession daemon on
>> master@master-node1-hostname (Node 1)
>>
>> I have tried both passwordless ssh and password ssh on all machines but
>> the result is the same. In the latter case, it does ask for
>> slave@slave-node2-hostname, slave@slave-node3-hostname passowords but
>> fails to display any message like "starting taskexecutor daemon on xxxx"
>> after that.
>>
>> I switched my master node to Node 2 and set Node 1 to slave. It was able
>> to start taskexecutor daemons on* both Node 2 and Node 3 *successfully
>> but did nothing for Node 1.
>>
>> I'd appreciate if you can advice on what the problem here could be and
>> how I can resolve it.
>>
>> Best Regards,
>> Komal
>>
>>
>>
>>
>>

Re: Problem starting taskexecutor daemons in 3 node cluster

Reply via email to