Re: Problem starting taskexecutor daemons in 3 node cluster
SSH access to the nodes and nodes being able to talk to each other are separate issues. The former is only used for starting the Flink cluster. Once the cluster is started, Flink only requires that nodes can talk to each other (independent of SSH). Cheers, Till On Tue, Sep 17, 2019 at 7:39 AM Komal Mariam wrote: > Hi Till, > > Thank you for the reply. I tried to ssh each of the nodes individually > with each other and they all can connect to each other. Its just that all > the other worker nodes cannot for some reason. connect to the job manager > on 150.82.218.218:6123. (Node 1) > > I got around the problem by setting the master node (JobManager) on Node 2 > and making 150.82.218.218 as a slave (TaskManager). > > now, all nodes including 150.82.218.218 are showing up in the new > jobmanager's UI and I can see my jobs getting distributed between them too. > > For now, all my nodes have password enabled SSH. Do you think this issue > could be because I have not set passwordless SSH? If the start-cluster.yaml > can instantiate the nodes with password ssh why is it important to set > passwordless SSH (aside from convenience)? > > Best Regards, > Komal > > On Fri, 13 Sep 2019 at 18:31, Till Rohrmann wrote: > >> Hi Komal, >> >> could you check that every node can reach the other nodes? It looks a >> little bit as if the TaskManager cannot talk to the JobManager running on >> 150.82.218.218:6123. >> >> Cheers, >> Till >> >> On Thu, Sep 12, 2019 at 9:30 AM Komal Mariam >> wrote: >> >>> I managed to fix it however ran into another problem that I could >>> appreciate help in resolving. >>> >>> it turns out that the username for all three nodes was different. having >>> the same username for them fixed the issue. i.e >>> same_username@slave-node2-hostname >>> same_username@slave-node3-hostname >>> same_username@master-node1-hostname >>> >>> Infact, because the usernames are the same, I can just save them in the >>> conf files as: >>> slave-node2-hostname >>> slave-node3-hostname >>> master-node1-hostname >>> >>> However, for some reason my worker nodes dont show up in the available >>> task manager in the web UI. >>> >>> The taskexecutor log says the following: >>> ... (clipped for brevity) >>> 2019-09-12 15:56:36,625 INFO >>> org.apache.flink.runtime.taskexecutor.TaskManagerRunner - >>> >>> 2019-09-12 15:56:36,631 INFO >>> org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Registered >>> UNIX signal handlers for [TERM, HUP, INT] >>> 2019-09-12 15:56:36,647 INFO >>> org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Maximum >>> number of open file descriptors is 1048576. >>> 2019-09-12 15:56:36,710 INFO >>> org.apache.flink.configuration.GlobalConfiguration- Loading >>> configuration property: jobmanager.rpc.address, 150.82.218.218 >>> 2019-09-12 15:56:36,711 INFO >>> org.apache.flink.configuration.GlobalConfiguration- Loading >>> configuration property: jobmanager.rpc.port, 6123 >>> 2019-09-12 15:56:36,712 INFO >>> org.apache.flink.configuration.GlobalConfiguration- Loading >>> configuration property: jobmanager.heap.size, 1024m >>> 2019-09-12 15:56:36,713 INFO >>> org.apache.flink.configuration.GlobalConfiguration- Loading >>> configuration property: taskmanager.heap.size, 1024m >>> 2019-09-12 15:56:36,714 INFO >>> org.apache.flink.configuration.GlobalConfiguration- Loading >>> configuration property: taskmanager.numberOfTaskSlots, 1 >>> 2019-09-12 15:56:36,715 INFO >>> org.apache.flink.configuration.GlobalConfiguration- Loading >>> configuration property: parallelism.default, 1 >>> 2019-09-12 15:56:36,717 INFO >>> org.apache.flink.configuration.GlobalConfiguration- Loading >>> configuration property: jobmanager.execution.failover-strategy, region >>> 2019-09-12 15:56:37,097 INFO org.apache.flink.core.fs.FileSystem >>> - Hadoop is not in the classpath/dependencies. The >>> extended set of supported File Systems via Hadoop is not available. >>> 2019-09-12 15:56:37,221 INFO >>> org.apache.flink.runtime.security.modules.HadoopModuleFactory - Cannot >>> create Hadoop Security Module because Hadoop cannot be found in the >>> Classpath. >>> 2019-09-12 15:56:37,305 INFO >>> org.apache.flink.runtime.security.SecurityUtils - Cannot >>> install HadoopSecurityContext because Hadoop cannot be found in the >>> Classpath. >>> 2019-09-12 15:56:38,142 INFO >>> org.apache.flink.configuration.Configuration - Config >>> uses fallback configuration key 'jobmanager.rpc.address' instead of key >>> 'rest.address' >>> 2019-09-12 15:56:38,169 INFO >>> org.apache.flink.runtime.util.LeaderRetrievalUtils- Trying to >>> select the network interface and address to use by connecting to the >>> leading JobManager. >>> 2019-09-12 15:56:38,170 INFO >>>
Re: Problem starting taskexecutor daemons in 3 node cluster
Hi Till, Thank you for the reply. I tried to ssh each of the nodes individually with each other and they all can connect to each other. Its just that all the other worker nodes cannot for some reason. connect to the job manager on 150.82.218.218:6123. (Node 1) I got around the problem by setting the master node (JobManager) on Node 2 and making 150.82.218.218 as a slave (TaskManager). now, all nodes including 150.82.218.218 are showing up in the new jobmanager's UI and I can see my jobs getting distributed between them too. For now, all my nodes have password enabled SSH. Do you think this issue could be because I have not set passwordless SSH? If the start-cluster.yaml can instantiate the nodes with password ssh why is it important to set passwordless SSH (aside from convenience)? Best Regards, Komal On Fri, 13 Sep 2019 at 18:31, Till Rohrmann wrote: > Hi Komal, > > could you check that every node can reach the other nodes? It looks a > little bit as if the TaskManager cannot talk to the JobManager running on > 150.82.218.218:6123. > > Cheers, > Till > > On Thu, Sep 12, 2019 at 9:30 AM Komal Mariam > wrote: > >> I managed to fix it however ran into another problem that I could >> appreciate help in resolving. >> >> it turns out that the username for all three nodes was different. having >> the same username for them fixed the issue. i.e >> same_username@slave-node2-hostname >> same_username@slave-node3-hostname >> same_username@master-node1-hostname >> >> Infact, because the usernames are the same, I can just save them in the >> conf files as: >> slave-node2-hostname >> slave-node3-hostname >> master-node1-hostname >> >> However, for some reason my worker nodes dont show up in the available >> task manager in the web UI. >> >> The taskexecutor log says the following: >> ... (clipped for brevity) >> 2019-09-12 15:56:36,625 INFO >> org.apache.flink.runtime.taskexecutor.TaskManagerRunner - >> >> 2019-09-12 15:56:36,631 INFO >> org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Registered >> UNIX signal handlers for [TERM, HUP, INT] >> 2019-09-12 15:56:36,647 INFO >> org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Maximum >> number of open file descriptors is 1048576. >> 2019-09-12 15:56:36,710 INFO >> org.apache.flink.configuration.GlobalConfiguration- Loading >> configuration property: jobmanager.rpc.address, 150.82.218.218 >> 2019-09-12 15:56:36,711 INFO >> org.apache.flink.configuration.GlobalConfiguration- Loading >> configuration property: jobmanager.rpc.port, 6123 >> 2019-09-12 15:56:36,712 INFO >> org.apache.flink.configuration.GlobalConfiguration- Loading >> configuration property: jobmanager.heap.size, 1024m >> 2019-09-12 15:56:36,713 INFO >> org.apache.flink.configuration.GlobalConfiguration- Loading >> configuration property: taskmanager.heap.size, 1024m >> 2019-09-12 15:56:36,714 INFO >> org.apache.flink.configuration.GlobalConfiguration- Loading >> configuration property: taskmanager.numberOfTaskSlots, 1 >> 2019-09-12 15:56:36,715 INFO >> org.apache.flink.configuration.GlobalConfiguration- Loading >> configuration property: parallelism.default, 1 >> 2019-09-12 15:56:36,717 INFO >> org.apache.flink.configuration.GlobalConfiguration- Loading >> configuration property: jobmanager.execution.failover-strategy, region >> 2019-09-12 15:56:37,097 INFO org.apache.flink.core.fs.FileSystem >> - Hadoop is not in the classpath/dependencies. The >> extended set of supported File Systems via Hadoop is not available. >> 2019-09-12 15:56:37,221 INFO >> org.apache.flink.runtime.security.modules.HadoopModuleFactory - Cannot >> create Hadoop Security Module because Hadoop cannot be found in the >> Classpath. >> 2019-09-12 15:56:37,305 INFO >> org.apache.flink.runtime.security.SecurityUtils - Cannot >> install HadoopSecurityContext because Hadoop cannot be found in the >> Classpath. >> 2019-09-12 15:56:38,142 INFO >> org.apache.flink.configuration.Configuration - Config >> uses fallback configuration key 'jobmanager.rpc.address' instead of key >> 'rest.address' >> 2019-09-12 15:56:38,169 INFO >> org.apache.flink.runtime.util.LeaderRetrievalUtils- Trying to >> select the network interface and address to use by connecting to the >> leading JobManager. >> 2019-09-12 15:56:38,170 INFO >> org.apache.flink.runtime.util.LeaderRetrievalUtils- >> TaskManager will try to connect for 1 milliseconds before falling back >> to heuristics >> 2019-09-12 15:56:38,185 INFO >> org.apache.flink.runtime.net.ConnectionUtils - Retrieved >> new target address /150.82.218.218:6123. >> 2019-09-12 15:56:39,691 INFO >> org.apache.flink.runtime.net.ConnectionUtils - Trying to >> connect to address
Re: Problem starting taskexecutor daemons in 3 node cluster
Hi Komal, could you check that every node can reach the other nodes? It looks a little bit as if the TaskManager cannot talk to the JobManager running on 150.82.218.218:6123. Cheers, Till On Thu, Sep 12, 2019 at 9:30 AM Komal Mariam wrote: > I managed to fix it however ran into another problem that I could > appreciate help in resolving. > > it turns out that the username for all three nodes was different. having > the same username for them fixed the issue. i.e > same_username@slave-node2-hostname > same_username@slave-node3-hostname > same_username@master-node1-hostname > > Infact, because the usernames are the same, I can just save them in the > conf files as: > slave-node2-hostname > slave-node3-hostname > master-node1-hostname > > However, for some reason my worker nodes dont show up in the available > task manager in the web UI. > > The taskexecutor log says the following: > ... (clipped for brevity) > 2019-09-12 15:56:36,625 INFO > org.apache.flink.runtime.taskexecutor.TaskManagerRunner - > > 2019-09-12 15:56:36,631 INFO > org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Registered > UNIX signal handlers for [TERM, HUP, INT] > 2019-09-12 15:56:36,647 INFO > org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Maximum > number of open file descriptors is 1048576. > 2019-09-12 15:56:36,710 INFO > org.apache.flink.configuration.GlobalConfiguration- Loading > configuration property: jobmanager.rpc.address, 150.82.218.218 > 2019-09-12 15:56:36,711 INFO > org.apache.flink.configuration.GlobalConfiguration- Loading > configuration property: jobmanager.rpc.port, 6123 > 2019-09-12 15:56:36,712 INFO > org.apache.flink.configuration.GlobalConfiguration- Loading > configuration property: jobmanager.heap.size, 1024m > 2019-09-12 15:56:36,713 INFO > org.apache.flink.configuration.GlobalConfiguration- Loading > configuration property: taskmanager.heap.size, 1024m > 2019-09-12 15:56:36,714 INFO > org.apache.flink.configuration.GlobalConfiguration- Loading > configuration property: taskmanager.numberOfTaskSlots, 1 > 2019-09-12 15:56:36,715 INFO > org.apache.flink.configuration.GlobalConfiguration- Loading > configuration property: parallelism.default, 1 > 2019-09-12 15:56:36,717 INFO > org.apache.flink.configuration.GlobalConfiguration- Loading > configuration property: jobmanager.execution.failover-strategy, region > 2019-09-12 15:56:37,097 INFO org.apache.flink.core.fs.FileSystem > - Hadoop is not in the classpath/dependencies. The > extended set of supported File Systems via Hadoop is not available. > 2019-09-12 15:56:37,221 INFO > org.apache.flink.runtime.security.modules.HadoopModuleFactory - Cannot > create Hadoop Security Module because Hadoop cannot be found in the > Classpath. > 2019-09-12 15:56:37,305 INFO > org.apache.flink.runtime.security.SecurityUtils - Cannot > install HadoopSecurityContext because Hadoop cannot be found in the > Classpath. > 2019-09-12 15:56:38,142 INFO org.apache.flink.configuration.Configuration > - Config uses fallback configuration key > 'jobmanager.rpc.address' instead of key 'rest.address' > 2019-09-12 15:56:38,169 INFO > org.apache.flink.runtime.util.LeaderRetrievalUtils- Trying to > select the network interface and address to use by connecting to the > leading JobManager. > 2019-09-12 15:56:38,170 INFO > org.apache.flink.runtime.util.LeaderRetrievalUtils- > TaskManager will try to connect for 1 milliseconds before falling back > to heuristics > 2019-09-12 15:56:38,185 INFO org.apache.flink.runtime.net.ConnectionUtils > - Retrieved new target address /150.82.218.218:6123. > 2019-09-12 15:56:39,691 INFO org.apache.flink.runtime.net.ConnectionUtils > - Trying to connect to address /150.82.218.218:6123 > 2019-09-12 15:56:39,693 INFO org.apache.flink.runtime.net.ConnectionUtils > - Failed to connect from address 'salman-hpc/127.0.1.1': > Invalid argument (connect failed) > 2019-09-12 15:56:39,696 INFO org.apache.flink.runtime.net.ConnectionUtils > - Failed to connect from address '/150.82.219.73': No > route to host (Host unreachable) > 2019-09-12 15:56:39,698 INFO org.apache.flink.runtime.net.ConnectionUtils > - Failed to connect from address > '/fe80:0:0:0:1e10:83f4:a33a:a208%enp5s0f1': Network is unreachable (connect > failed) > 2019-09-12 15:56:39,748 INFO org.apache.flink.runtime.net.ConnectionUtils > - Failed to connect from address '/150.82.219.73': > connect timed out > 2019-09-12 15:56:39,750 INFO org.apache.flink.runtime.net.ConnectionUtils > - Failed to connect from address '/0:0:0:0:0:0:0:1%lo': > Network is unreachable (connect failed) > 2019-09-12
Re: Problem starting taskexecutor daemons in 3 node cluster
I managed to fix it however ran into another problem that I could appreciate help in resolving. it turns out that the username for all three nodes was different. having the same username for them fixed the issue. i.e same_username@slave-node2-hostname same_username@slave-node3-hostname same_username@master-node1-hostname Infact, because the usernames are the same, I can just save them in the conf files as: slave-node2-hostname slave-node3-hostname master-node1-hostname However, for some reason my worker nodes dont show up in the available task manager in the web UI. The taskexecutor log says the following: ... (clipped for brevity) 2019-09-12 15:56:36,625 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - 2019-09-12 15:56:36,631 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Registered UNIX signal handlers for [TERM, HUP, INT] 2019-09-12 15:56:36,647 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Maximum number of open file descriptors is 1048576. 2019-09-12 15:56:36,710 INFO org.apache.flink.configuration.GlobalConfiguration- Loading configuration property: jobmanager.rpc.address, 150.82.218.218 2019-09-12 15:56:36,711 INFO org.apache.flink.configuration.GlobalConfiguration- Loading configuration property: jobmanager.rpc.port, 6123 2019-09-12 15:56:36,712 INFO org.apache.flink.configuration.GlobalConfiguration- Loading configuration property: jobmanager.heap.size, 1024m 2019-09-12 15:56:36,713 INFO org.apache.flink.configuration.GlobalConfiguration- Loading configuration property: taskmanager.heap.size, 1024m 2019-09-12 15:56:36,714 INFO org.apache.flink.configuration.GlobalConfiguration- Loading configuration property: taskmanager.numberOfTaskSlots, 1 2019-09-12 15:56:36,715 INFO org.apache.flink.configuration.GlobalConfiguration- Loading configuration property: parallelism.default, 1 2019-09-12 15:56:36,717 INFO org.apache.flink.configuration.GlobalConfiguration- Loading configuration property: jobmanager.execution.failover-strategy, region 2019-09-12 15:56:37,097 INFO org.apache.flink.core.fs.FileSystem - Hadoop is not in the classpath/dependencies. The extended set of supported File Systems via Hadoop is not available. 2019-09-12 15:56:37,221 INFO org.apache.flink.runtime.security.modules.HadoopModuleFactory - Cannot create Hadoop Security Module because Hadoop cannot be found in the Classpath. 2019-09-12 15:56:37,305 INFO org.apache.flink.runtime.security.SecurityUtils - Cannot install HadoopSecurityContext because Hadoop cannot be found in the Classpath. 2019-09-12 15:56:38,142 INFO org.apache.flink.configuration.Configuration - Config uses fallback configuration key 'jobmanager.rpc.address' instead of key 'rest.address' 2019-09-12 15:56:38,169 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils- Trying to select the network interface and address to use by connecting to the leading JobManager. 2019-09-12 15:56:38,170 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils- TaskManager will try to connect for 1 milliseconds before falling back to heuristics 2019-09-12 15:56:38,185 INFO org.apache.flink.runtime.net.ConnectionUtils - Retrieved new target address /150.82.218.218:6123. 2019-09-12 15:56:39,691 INFO org.apache.flink.runtime.net.ConnectionUtils - Trying to connect to address /150.82.218.218:6123 2019-09-12 15:56:39,693 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address 'salman-hpc/127.0.1.1': Invalid argument (connect failed) 2019-09-12 15:56:39,696 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/150.82.219.73': No route to host (Host unreachable) 2019-09-12 15:56:39,698 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/fe80:0:0:0:1e10:83f4:a33a:a208%enp5s0f1': Network is unreachable (connect failed) 2019-09-12 15:56:39,748 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/150.82.219.73': connect timed out 2019-09-12 15:56:39,750 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/0:0:0:0:0:0:0:1%lo': Network is unreachable (connect failed) 2019-09-12 15:56:39,751 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed) 2019-09-12 15:56:39,753 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/fe80:0:0:0:1e10:83f4:a33a:a208%enp5s0f1': Network is unreachable (connect failed) "flink-komal-taskexecutor-0-salman-hpc.log" 157L, 29954C I'd