Re: Unable to start Flink HA cluster with Zookeeper

2018-08-22 Thread mozer
Thanks for the info, I have managed to launch a HA cluster with adding
rpc.address for all job managers. 
But it did not work with start-cluster.sh, I had to add one by one. 





--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Re: Unable to start Flink HA cluster with Zookeeper

2018-08-21 Thread mozer
Yeah, you are right. I have already tried to set up jobmanager.rpc.adress and
it works in that case, but if I use this setting I will not be able to use
HA, am i right ?
How the job manager can register to zookeeper with the right address but not
localhost ? 





--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Re: Unable to start Flink HA cluster with Zookeeper

2018-08-21 Thread mozer
FQD or full ip; tried all of them, still no changes ... 
For ssh connection, I can connect to each machine without passwords. 


Do you think that the problem can come from : 

*high-availability.storageDir: file:///shareflink/recovery* ? 

I don't use a HDFS storage but NAS file system which is common for two
machines. 

I also added ; 


state.backend: filesystem
state.checkpoints.fs.dir: file:///shareflink/recovery/checkpoint
blob.storage.directory: file:///shareflink/recovery/blob

Logs for zookeeper file : 

2018-08-21 14:59:32,652 INFO 
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.server.ZooKeeperServer 
- tickTime set to 2000
2018-08-21 14:59:32,653 INFO 
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.server.ZooKeeperServer 
- minSessionTimeout set to -1
2018-08-21 14:59:32,653 INFO 
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.server.ZooKeeperServer 
- maxSessionTimeout set to -1
2018-08-21 14:59:32,661 INFO 
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.server.NIOServerCnxnFactory
 
- binding to port 0.0.0.0/0.0.0.0:2181
2018-08-21 14:59:39,940 INFO 
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.server.NIOServerCnxnFactory
 
- Accepted socket connection from /Machine1:60186
2018-08-21 14:59:40,015 INFO 
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.server.NIOServerCnxnFactory
 
- Accepted socket connection from /Machine2:54466
2018-08-21 14:59:40,017 INFO 
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.server.ZooKeeperServer 
- Client attempting to establish new session at /Machine1:60186
2018-08-21 14:59:40,017 INFO 
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.server.ZooKeeperServer 
- Client attempting to establish new session at /Machine2:54466

Log for Job Manager : 

2018-08-21 14:59:39,327 INFO 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Trying to
start actor system at 127.0.0.1:50101
2018-08-21 14:59:39,723 INFO  akka.event.slf4j.Slf4jLogger  
   
- Slf4jLogger started
2018-08-21 14:59:39,766 INFO  akka.remote.Remoting  
   
- Starting remoting
2018-08-21 14:59:39,859 INFO  akka.remote.Remoting  
   
- Remoting started; listening on addresses
:[akka.tcp://flink@127.0.0.1:50101]
2018-08-21 14:59:39,865 INFO 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Actor system
started at akka.tcp://flink@127.0.0.1:50101
2018-08-21 14:59:39,872 INFO 
org.apache.flink.runtime.blob.FileSystemBlobStore - Creating
highly available BLOB storage directory at
file:///shareflink/recovery///blob
2018-08-21 14:59:39,876 INFO  org.apache.flink.runtime.util.ZooKeeperUtils  
   
- Enforcing default ACL for ZK connections
2018-08-21 14:59:39,876 INFO  org.apache.flink.runtime.util.ZooKeeperUtils  
   
- Using '/usr/flink-1.5.1/' as Zookeeper namespace.
2018-08-21 14:59:39,919 INFO 
org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl
 
- Starting





--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Unable to start Flink HA cluster with Zookeeper

2018-08-21 Thread mozer
I am trying to install a Flink HA cluster (Zookeeper mode) but the task
manager cannot find the job manager. 

Here I give you the architecture; 

- Machine 1 : Job Manager + Zookeeper
- Machine 2 : Task Manager

masters: 

Machine1

slaves : 

Machine2

flink-conf.yaml: 

#jobmanager.rpc.address: localhost
jobmanager.rpc.port: 6123
blob.server.port: 50100-50200
taskmanager.data.port: 6121
high-availability: zookeeper
high-availability.zookeeper.quorum: Machine1:2181
high-availability.zookeeper.path.root: /flink-1.5.1
high-availability.cluster-id: /default_b
high-availability.storageDir: file:///shareflink/recovery

Here this is the log of Task Manager, it tries to connect to localhost
instead of Machine1:

2018-08-17 10:46:44,875 INFO 
org.apache.flink.runtime.util.LeaderRetrievalUtils- Trying to
select the network interface and address to use by connecting to the leading
JobManager.
2018-08-17 10:46:44,876 INFO 
org.apache.flink.runtime.util.LeaderRetrievalUtils- TaskManager
will try to connect for 1 milliseconds before falling back to heuristics
2018-08-17 10:46:44,966 INFO 
org.apache.flink.runtime.net.ConnectionUtils  - Retrieved
new target address /127.0.0.1:37133.
2018-08-17 10:46:45,324 INFO 
org.apache.flink.runtime.net.ConnectionUtils  - Trying to
connect to address /127.0.0.1:37133
2018-08-17 10:46:45,325 INFO 
org.apache.flink.runtime.net.ConnectionUtils  - Failed to
connect from address 'Machine2/IP-Machine2': Connection refused
2018-08-17 10:46:45,325 INFO 
org.apache.flink.runtime.net.ConnectionUtils  - Failed to
connect from address '/127.0.0.1': Connection refused
2018-08-17 10:46:45,325 INFO 
org.apache.flink.runtime.net.ConnectionUtils  - Failed to
connect from address '/IP_Machine2': Connection refused
2018-08-17 10:46:45,325 INFO 
org.apache.flink.runtime.net.ConnectionUtils  - Failed to
connect from address '/127.0.0.1': Connection refused
2018-08-17 10:46:45,326 INFO 
org.apache.flink.runtime.net.ConnectionUtils  - Failed to
connect from address '/IP_Machine2': Connection refused
2018-08-17 10:46:45,326 INFO 
org.apache.flink.runtime.net.ConnectionUtils  - Failed to
connect from address '/127.0.0.1': Connection refused
2018-08-17 10:46:45,726 INFO 
org.apache.flink.runtime.net.ConnectionUtils  - Trying to
connect to address /127.0.0.1:37133
2018-08-17 10:46:45,727 INFO 
org.apache.flink.runtime.net.ConnectionUtils  - Failed to
connect from address 'Machine2/IP-Machine2

2018-08-17 10:47:22,022 WARN  akka.remote.ReliableDeliverySupervisor
   
- Association with remote system [akka.tcp://flink@127.0.0.1:36515] has
failed, address is now gated for [50] ms. Reason: [Association failed with
[akka.tcp://flink@127.0.0.1:36515]] Caused by: [Connection refused:
/127.0.0.1:36515]

2018-08-17 10:47:22,022 INFO 
org.apache.flink.runtime.taskexecutor.TaskExecutor- Could not
resolve ResourceManager address
akka.tcp://flink@127.0.0.1:36515/user/resourcemanager, retrying in 1 ms:
Could not connect to rpc endpoint under address
akka.tcp://flink@127.0.0.1:36515/user/resourcemanager..
2018-08-17 10:47:32,037 WARN  akka.remote.transport.netty.NettyTransport
   
- Remote connection to [null] failed with java.net.ConnectException:
Connection refused: /127.0.0.1:36515



PS. : **/etc/hosts** contains the **localhost, Machine1 and Machine2**


Can you please tell me how the Task Manager can connect to Job Manager ? 

Regards





--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Unable to start Flink HA cluster with Zookeeper

2018-08-21 Thread mozer
I am trying to install a Flink HA cluster (Zookeeper mode) but the task
manager cannot find the job manager. 

Here I give you the architecture; 

- Machine 1 : Job Manager + Zookeeper
- Machine 2 : Task Manager

masters: 

Machine1

slaves : 

Machine2

flink-conf.yaml: 

#jobmanager.rpc.address: localhost
jobmanager.rpc.port: 6123
blob.server.port: 50100-50200
taskmanager.data.port: 6121
high-availability: zookeeper
high-availability.zookeeper.quorum: Machine1:2181
high-availability.zookeeper.path.root: /flink-1.5.1
high-availability.cluster-id: /default_b
high-availability.storageDir: file:///shareflink/recovery

Here this is the log of Task Manager, it tries to connect to localhost
instead of Machine1:

2018-08-17 10:46:44,875 INFO 
org.apache.flink.runtime.util.LeaderRetrievalUtils- Trying to
select the network interface and address to use by connecting to the leading
JobManager.
2018-08-17 10:46:44,876 INFO 
org.apache.flink.runtime.util.LeaderRetrievalUtils- TaskManager
will try to connect for 1 milliseconds before falling back to heuristics
2018-08-17 10:46:44,966 INFO 
org.apache.flink.runtime.net.ConnectionUtils  - Retrieved
new target address /127.0.0.1:37133.
2018-08-17 10:46:45,324 INFO 
org.apache.flink.runtime.net.ConnectionUtils  - Trying to
connect to address /127.0.0.1:37133
2018-08-17 10:46:45,325 INFO 
org.apache.flink.runtime.net.ConnectionUtils  - Failed to
connect from address 'Machine2/IP-Machine2': Connection refused
2018-08-17 10:46:45,325 INFO 
org.apache.flink.runtime.net.ConnectionUtils  - Failed to
connect from address '/127.0.0.1': Connection refused
2018-08-17 10:46:45,325 INFO 
org.apache.flink.runtime.net.ConnectionUtils  - Failed to
connect from address '/IP_Machine2': Connection refused
2018-08-17 10:46:45,325 INFO 
org.apache.flink.runtime.net.ConnectionUtils  - Failed to
connect from address '/127.0.0.1': Connection refused
2018-08-17 10:46:45,326 INFO 
org.apache.flink.runtime.net.ConnectionUtils  - Failed to
connect from address '/IP_Machine2': Connection refused
2018-08-17 10:46:45,326 INFO 
org.apache.flink.runtime.net.ConnectionUtils  - Failed to
connect from address '/127.0.0.1': Connection refused
2018-08-17 10:46:45,726 INFO 
org.apache.flink.runtime.net.ConnectionUtils  - Trying to
connect to address /127.0.0.1:37133
2018-08-17 10:46:45,727 INFO 
org.apache.flink.runtime.net.ConnectionUtils  - Failed to
connect from address 'Machine2/IP-Machine2

2018-08-17 10:47:22,022 WARN  akka.remote.ReliableDeliverySupervisor
   
- Association with remote system [akka.tcp://flink@127.0.0.1:36515] has
failed, address is now gated for [50] ms. Reason: [Association failed with
[akka.tcp://flink@127.0.0.1:36515]] Caused by: [Connection refused:
/127.0.0.1:36515]

2018-08-17 10:47:22,022 INFO 
org.apache.flink.runtime.taskexecutor.TaskExecutor- Could not
resolve ResourceManager address
akka.tcp://flink@127.0.0.1:36515/user/resourcemanager, retrying in 1 ms:
Could not connect to rpc endpoint under address
akka.tcp://flink@127.0.0.1:36515/user/resourcemanager..
2018-08-17 10:47:32,037 WARN  akka.remote.transport.netty.NettyTransport
   
- Remote connection to [null] failed with java.net.ConnectException:
Connection refused: /127.0.0.1:36515



PS. : **/etc/hosts** contains the **localhost, Machine1 and Machine2**


Can you please tell me how the Task Manager can connect to Job Manager ? 

Regards





--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/