Re: TaskManager not connecting to ResourceManager in HA mode

2019-08-22 Thread Zili Chen
Nice to hear :-)

Best,
tison.


Aleksandar Mastilovic  于2019年8月23日周五 上午2:22写道:

> Thanks for all the help, people - you made me go through my code once
> again and discover that I switched argument positions for job manager and
> resource manager addresses :-)
>
> The docker ensemble now starts fine, I’m working on ironing out the bugs
> now.
>
> I’ll participate in the survey too!
>
> On Aug 21, 2019, at 7:18 PM, Zili Chen  wrote:
>
> Besides, would you like to participant our survey thread[1] on
> user list about "How do you use high-availability services in Flink?"
>
> It would help Flink improve its high-availability serving.
>
> Best,
> tison.
>
> [1]
> https://lists.apache.org/x/thread.html/c0cc07197e6ba30b45d7709cc9e17d8497e5e3f33de504d58dfcafad@%3Cuser.flink.apache.org%3E
>
>
> Zili Chen  于2019年8月22日周四 上午10:16写道:
>
>> Hi Aleksandar,
>>
>> base on your log:
>>
>> taskmanager_1   | 2019-08-22 00:05:03,713 INFO
>>  org.apache.flink.runtime.taskexecutor.TaskExecutor- Connecting
>> to ResourceManager
>> akka.tcp://flink@jobmanager:6123/user/jobmanager()
>> .
>> taskmanager_1   | 2019-08-22 00:05:04,137 INFO
>>  org.apache.flink.runtime.taskexecutor.TaskExecutor- Could not
>> resolve ResourceManager address
>> akka.tcp://flink@jobmanager:6123/user/jobmanager, retrying in 1 ms:
>> Could not connect to rpc endpoint under address
>> akka.tcp://flink@jobmanager:6123/user/jobmanager..
>>
>> it looks like you return a jobmanager address on retrieval service of
>> resource manager. Please check the implementation carefully or share it on
>> mailing list that others can help for investigation.
>>
>> Best,
>> tison.
>>
>>
>> Zhu Zhu  于2019年8月22日周四 上午10:11写道:
>>
>>> Hi Aleksandar,
>>>
>>> The resource manager address is retrieved from the HA services.
>>> Would you check whether your customized HA services is returning the
>>> right  LeaderRetrievalService and whether the LeaderRetrievalService is
>>> really retrieving the right leader's address?
>>> Or is it possible that the stored resource manager address in HA is
>>> replaced by jobmanager address in any case?
>>>
>>> Thanks,
>>> Zhu Zhu
>>>
>>> Aleksandar Mastilovic  于2019年8月22日周四
>>> 上午8:16写道:
>>>
 Hi all,

 I’m experimenting with using my own implementation of HA services
 instead of ZooKeeper that would persist JobManager information on a
 Kubernetes volume instead of in ZooKeeper.

 I’ve set the high-availability option in flink-conf.yaml to the FQN of
 my factory class, and started the docker ensemble as I usually do (i.e.
 with no special “cluster” arguments or scripts.)

 What’s happening now is that TaskManager is unable to connect to
 ResourceManager, because it seems it’s trying to use the /user/jobmanager
 path instead of /user/resourcemanager.

 Here’s what I found in the logs:


 jobmanager_1| 2019-08-22 00:05:00,963 INFO  akka.remote.Remoting
- Remoting started; listening on
 addresses :[akka.tcp://flink@jobmanager:6123]
 jobmanager_1| 2019-08-22 00:05:00,975 INFO
  org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Actor
 system started at akka.tcp://flink@jobmanager:6123

 jobmanager_1| 2019-08-22 00:05:02,380 INFO
  org.apache.flink.runtime.rpc.akka.AkkaRpcService  - Starting
 RPC endpoint for
 org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at
 akka://flink/user/resourcemanager .

 jobmanager_1| 2019-08-22 00:05:03,138 INFO
  org.apache.flink.runtime.rpc.akka.AkkaRpcService  - Starting
 RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher
 at akka://flink/user/dispatcher .

 jobmanager_1| 2019-08-22 00:05:03,211 INFO
  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
 ResourceManager akka.tcp://flink@jobmanager:6123/user/resourcemanager
 was granted leadership with fencing token 

 jobmanager_1| 2019-08-22 00:05:03,292 INFO
  org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Dispatcher
 akka.tcp://flink@jobmanager:6123/user/dispatcher was granted
 leadership with fencing token ----

 taskmanager_1   | 2019-08-22 00:05:03,713 INFO
  org.apache.flink.runtime.taskexecutor.TaskExecutor- Connecting
 to ResourceManager
 akka.tcp://flink@jobmanager:6123/user/jobmanager()
 .
 taskmanager_1   | 2019-08-22 00:05:04,137 INFO
  org.apache.flink.runtime.taskexecutor.TaskExecutor- Could not
 resolve ResourceManager address
 akka.tcp://flink@jobmanager:6123/user/jobmanager, retrying in 1
 ms: Could not connect to rpc endpoint under address
 

Re: TaskManager not connecting to ResourceManager in HA mode

2019-08-22 Thread Aleksandar Mastilovic
Thanks for all the help, people - you made me go through my code once again and 
discover that I switched argument positions for job manager and resource 
manager addresses :-)

The docker ensemble now starts fine, I’m working on ironing out the bugs now.

I’ll participate in the survey too!

> On Aug 21, 2019, at 7:18 PM, Zili Chen  wrote:
> 
> Besides, would you like to participant our survey thread[1] on
> user list about "How do you use high-availability services in Flink?"
> 
> It would help Flink improve its high-availability serving.
> 
> Best,
> tison.
> 
> [1] 
> https://lists.apache.org/x/thread.html/c0cc07197e6ba30b45d7709cc9e17d8497e5e3f33de504d58dfcafad@%3Cuser.flink.apache.org%3E
>  
> 
> 
> Zili Chen mailto:wander4...@gmail.com>> 于2019年8月22日周四 
> 上午10:16写道:
> Hi Aleksandar,
> 
> base on your log:
> 
> taskmanager_1   | 2019-08-22 00:05:03,713 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor- Connecting to 
> ResourceManager 
> akka.tcp://flink@jobmanager:6123/user/jobmanager()
>  <>.
> taskmanager_1   | 2019-08-22 00:05:04,137 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor- Could not 
> resolve ResourceManager address 
> akka.tcp://flink@jobmanager:6123/user/jobmanager <>, retrying in 1 ms: 
> Could not connect to rpc endpoint under address 
> akka.tcp://flink@jobmanager:6123/user/jobmanager <>..
> 
> it looks like you return a jobmanager address on retrieval service of 
> resource manager. Please check the implementation carefully or share it on 
> mailing list that others can help for investigation.
> 
> Best,
> tison.
> 
> 
> Zhu Zhu mailto:reed...@gmail.com>> 于2019年8月22日周四 
> 上午10:11写道:
> Hi Aleksandar,
> 
> The resource manager address is retrieved from the HA services.
> Would you check whether your customized HA services is returning the right  
> LeaderRetrievalService and whether the LeaderRetrievalService is really 
> retrieving the right leader's address?
> Or is it possible that the stored resource manager address in HA is replaced 
> by jobmanager address in any case?
> 
> Thanks,
> Zhu Zhu
> 
> Aleksandar Mastilovic  > 于2019年8月22日周四 上午8:16写道:
> Hi all,
> 
> I’m experimenting with using my own implementation of HA services instead of 
> ZooKeeper that would persist JobManager information on a Kubernetes volume 
> instead of in ZooKeeper.
> 
> I’ve set the high-availability option in flink-conf.yaml to the FQN of my 
> factory class, and started the docker ensemble as I usually do (i.e. with no 
> special “cluster” arguments or scripts.)
> 
> What’s happening now is that TaskManager is unable to connect to 
> ResourceManager, because it seems it’s trying to use the /user/jobmanager 
> path instead of /user/resourcemanager.
> 
> Here’s what I found in the logs:
> 
> 
> jobmanager_1| 2019-08-22 00:05:00,963 INFO  akka.remote.Remoting  
> - Remoting started; listening on addresses 
> :[akka.tcp://flink@jobmanager:6123 <>]
> jobmanager_1| 2019-08-22 00:05:00,975 INFO  
> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Actor system 
> started at akka.tcp://flink@jobmanager:6123 <>
> 
> jobmanager_1| 2019-08-22 00:05:02,380 INFO  
> org.apache.flink.runtime.rpc.akka.AkkaRpcService  - Starting RPC 
> endpoint for 
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at 
> akka://flink/user/resourcemanager <> .
> 
> jobmanager_1| 2019-08-22 00:05:03,138 INFO  
> org.apache.flink.runtime.rpc.akka.AkkaRpcService  - Starting RPC 
> endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at 
> akka://flink/user/dispatcher <> .
> 
> jobmanager_1| 2019-08-22 00:05:03,211 INFO  
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - 
> ResourceManager akka.tcp://flink@jobmanager:6123/user/resourcemanager <> was 
> granted leadership with fencing token 
> 
> jobmanager_1| 2019-08-22 00:05:03,292 INFO  
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Dispatcher 
> akka.tcp://flink@jobmanager:6123/user/dispatcher <> was granted leadership 
> with fencing token ----
> 
> taskmanager_1   | 2019-08-22 00:05:03,713 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor- Connecting to 
> ResourceManager 
> akka.tcp://flink@jobmanager:6123/user/jobmanager()
>  <>.
> taskmanager_1   | 2019-08-22 00:05:04,137 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor- Could not 
> resolve ResourceManager address 
> akka.tcp://flink@jobmanager:6123/user/jobmanager <>, retrying in 1 ms: 
> Could not connect to rpc endpoint under address 
> 

Re: TaskManager not connecting to ResourceManager in HA mode

2019-08-21 Thread Zili Chen
Besides, would you like to participant our survey thread[1] on
user list about "How do you use high-availability services in Flink?"

It would help Flink improve its high-availability serving.

Best,
tison.

[1]
https://lists.apache.org/x/thread.html/c0cc07197e6ba30b45d7709cc9e17d8497e5e3f33de504d58dfcafad@%3Cuser.flink.apache.org%3E


Zili Chen  于2019年8月22日周四 上午10:16写道:

> Hi Aleksandar,
>
> base on your log:
>
> taskmanager_1   | 2019-08-22 00:05:03,713 INFO
>  org.apache.flink.runtime.taskexecutor.TaskExecutor- Connecting
> to ResourceManager
> akka.tcp://flink@jobmanager:6123/user/jobmanager()
> .
> taskmanager_1   | 2019-08-22 00:05:04,137 INFO
>  org.apache.flink.runtime.taskexecutor.TaskExecutor- Could not
> resolve ResourceManager address
> akka.tcp://flink@jobmanager:6123/user/jobmanager, retrying in 1 ms:
> Could not connect to rpc endpoint under address
> akka.tcp://flink@jobmanager:6123/user/jobmanager..
>
> it looks like you return a jobmanager address on retrieval service of
> resource manager. Please check the implementation carefully or share it on
> mailing list that others can help for investigation.
>
> Best,
> tison.
>
>
> Zhu Zhu  于2019年8月22日周四 上午10:11写道:
>
>> Hi Aleksandar,
>>
>> The resource manager address is retrieved from the HA services.
>> Would you check whether your customized HA services is returning the
>> right  LeaderRetrievalService and whether the LeaderRetrievalService is
>> really retrieving the right leader's address?
>> Or is it possible that the stored resource manager address in HA is
>> replaced by jobmanager address in any case?
>>
>> Thanks,
>> Zhu Zhu
>>
>> Aleksandar Mastilovic  于2019年8月22日周四
>> 上午8:16写道:
>>
>>> Hi all,
>>>
>>> I’m experimenting with using my own implementation of HA services
>>> instead of ZooKeeper that would persist JobManager information on a
>>> Kubernetes volume instead of in ZooKeeper.
>>>
>>> I’ve set the high-availability option in flink-conf.yaml to the FQN of
>>> my factory class, and started the docker ensemble as I usually do (i.e.
>>> with no special “cluster” arguments or scripts.)
>>>
>>> What’s happening now is that TaskManager is unable to connect to
>>> ResourceManager, because it seems it’s trying to use the /user/jobmanager
>>> path instead of /user/resourcemanager.
>>>
>>> Here’s what I found in the logs:
>>>
>>>
>>> jobmanager_1| 2019-08-22 00:05:00,963 INFO  akka.remote.Remoting
>>>  - Remoting started; listening on
>>> addresses :[akka.tcp://flink@jobmanager:6123]
>>> jobmanager_1| 2019-08-22 00:05:00,975 INFO
>>>  org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Actor
>>> system started at akka.tcp://flink@jobmanager:6123
>>>
>>> jobmanager_1| 2019-08-22 00:05:02,380 INFO
>>>  org.apache.flink.runtime.rpc.akka.AkkaRpcService  - Starting
>>> RPC endpoint for
>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at
>>> akka://flink/user/resourcemanager .
>>>
>>> jobmanager_1| 2019-08-22 00:05:03,138 INFO
>>>  org.apache.flink.runtime.rpc.akka.AkkaRpcService  - Starting
>>> RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher
>>> at akka://flink/user/dispatcher .
>>>
>>> jobmanager_1| 2019-08-22 00:05:03,211 INFO
>>>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
>>> ResourceManager akka.tcp://flink@jobmanager:6123/user/resourcemanager
>>> was granted leadership with fencing token 
>>>
>>> jobmanager_1| 2019-08-22 00:05:03,292 INFO
>>>  org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Dispatcher
>>> akka.tcp://flink@jobmanager:6123/user/dispatcher was granted leadership
>>> with fencing token ----
>>>
>>> taskmanager_1   | 2019-08-22 00:05:03,713 INFO
>>>  org.apache.flink.runtime.taskexecutor.TaskExecutor- Connecting
>>> to ResourceManager
>>> akka.tcp://flink@jobmanager:6123/user/jobmanager()
>>> .
>>> taskmanager_1   | 2019-08-22 00:05:04,137 INFO
>>>  org.apache.flink.runtime.taskexecutor.TaskExecutor- Could not
>>> resolve ResourceManager address
>>> akka.tcp://flink@jobmanager:6123/user/jobmanager, retrying in 1 ms:
>>> Could not connect to rpc endpoint under address
>>> akka.tcp://flink@jobmanager:6123/user/jobmanager..
>>>
>>> Is this a known bug? I’d appreciate any help I can get.
>>>
>>> Thanks,
>>> Aleksandar Mastilovic
>>>
>>


Re: TaskManager not connecting to ResourceManager in HA mode

2019-08-21 Thread Zili Chen
Hi Aleksandar,

base on your log:

taskmanager_1   | 2019-08-22 00:05:03,713 INFO
 org.apache.flink.runtime.taskexecutor.TaskExecutor- Connecting
to ResourceManager
akka.tcp://flink@jobmanager:6123/user/jobmanager()
.
taskmanager_1   | 2019-08-22 00:05:04,137 INFO
 org.apache.flink.runtime.taskexecutor.TaskExecutor- Could not
resolve ResourceManager address
akka.tcp://flink@jobmanager:6123/user/jobmanager, retrying in 1 ms:
Could not connect to rpc endpoint under address
akka.tcp://flink@jobmanager:6123/user/jobmanager..

it looks like you return a jobmanager address on retrieval service of
resource manager. Please check the implementation carefully or share it on
mailing list that others can help for investigation.

Best,
tison.


Zhu Zhu  于2019年8月22日周四 上午10:11写道:

> Hi Aleksandar,
>
> The resource manager address is retrieved from the HA services.
> Would you check whether your customized HA services is returning the
> right  LeaderRetrievalService and whether the LeaderRetrievalService is
> really retrieving the right leader's address?
> Or is it possible that the stored resource manager address in HA is
> replaced by jobmanager address in any case?
>
> Thanks,
> Zhu Zhu
>
> Aleksandar Mastilovic  于2019年8月22日周四
> 上午8:16写道:
>
>> Hi all,
>>
>> I’m experimenting with using my own implementation of HA services instead
>> of ZooKeeper that would persist JobManager information on a Kubernetes
>> volume instead of in ZooKeeper.
>>
>> I’ve set the high-availability option in flink-conf.yaml to the FQN of my
>> factory class, and started the docker ensemble as I usually do (i.e. with
>> no special “cluster” arguments or scripts.)
>>
>> What’s happening now is that TaskManager is unable to connect to
>> ResourceManager, because it seems it’s trying to use the /user/jobmanager
>> path instead of /user/resourcemanager.
>>
>> Here’s what I found in the logs:
>>
>>
>> jobmanager_1| 2019-08-22 00:05:00,963 INFO  akka.remote.Remoting
>>  - Remoting started; listening on
>> addresses :[akka.tcp://flink@jobmanager:6123]
>> jobmanager_1| 2019-08-22 00:05:00,975 INFO
>>  org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Actor
>> system started at akka.tcp://flink@jobmanager:6123
>>
>> jobmanager_1| 2019-08-22 00:05:02,380 INFO
>>  org.apache.flink.runtime.rpc.akka.AkkaRpcService  - Starting
>> RPC endpoint for
>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at
>> akka://flink/user/resourcemanager .
>>
>> jobmanager_1| 2019-08-22 00:05:03,138 INFO
>>  org.apache.flink.runtime.rpc.akka.AkkaRpcService  - Starting
>> RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher
>> at akka://flink/user/dispatcher .
>>
>> jobmanager_1| 2019-08-22 00:05:03,211 INFO
>>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
>> ResourceManager akka.tcp://flink@jobmanager:6123/user/resourcemanager
>> was granted leadership with fencing token 
>>
>> jobmanager_1| 2019-08-22 00:05:03,292 INFO
>>  org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Dispatcher
>> akka.tcp://flink@jobmanager:6123/user/dispatcher was granted leadership
>> with fencing token ----
>>
>> taskmanager_1   | 2019-08-22 00:05:03,713 INFO
>>  org.apache.flink.runtime.taskexecutor.TaskExecutor- Connecting
>> to ResourceManager
>> akka.tcp://flink@jobmanager:6123/user/jobmanager()
>> .
>> taskmanager_1   | 2019-08-22 00:05:04,137 INFO
>>  org.apache.flink.runtime.taskexecutor.TaskExecutor- Could not
>> resolve ResourceManager address
>> akka.tcp://flink@jobmanager:6123/user/jobmanager, retrying in 1 ms:
>> Could not connect to rpc endpoint under address
>> akka.tcp://flink@jobmanager:6123/user/jobmanager..
>>
>> Is this a known bug? I’d appreciate any help I can get.
>>
>> Thanks,
>> Aleksandar Mastilovic
>>
>


Re: TaskManager not connecting to ResourceManager in HA mode

2019-08-21 Thread Zhu Zhu
Hi Aleksandar,

The resource manager address is retrieved from the HA services.
Would you check whether your customized HA services is returning the right
LeaderRetrievalService and whether the LeaderRetrievalService is really
retrieving the right leader's address?
Or is it possible that the stored resource manager address in HA is
replaced by jobmanager address in any case?

Thanks,
Zhu Zhu

Aleksandar Mastilovic  于2019年8月22日周四 上午8:16写道:

> Hi all,
>
> I’m experimenting with using my own implementation of HA services instead
> of ZooKeeper that would persist JobManager information on a Kubernetes
> volume instead of in ZooKeeper.
>
> I’ve set the high-availability option in flink-conf.yaml to the FQN of my
> factory class, and started the docker ensemble as I usually do (i.e. with
> no special “cluster” arguments or scripts.)
>
> What’s happening now is that TaskManager is unable to connect to
> ResourceManager, because it seems it’s trying to use the /user/jobmanager
> path instead of /user/resourcemanager.
>
> Here’s what I found in the logs:
>
>
> jobmanager_1| 2019-08-22 00:05:00,963 INFO  akka.remote.Remoting
>- Remoting started; listening on
> addresses :[akka.tcp://flink@jobmanager:6123]
> jobmanager_1| 2019-08-22 00:05:00,975 INFO
>  org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Actor
> system started at akka.tcp://flink@jobmanager:6123
>
> jobmanager_1| 2019-08-22 00:05:02,380 INFO
>  org.apache.flink.runtime.rpc.akka.AkkaRpcService  - Starting
> RPC endpoint for
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at
> akka://flink/user/resourcemanager .
>
> jobmanager_1| 2019-08-22 00:05:03,138 INFO
>  org.apache.flink.runtime.rpc.akka.AkkaRpcService  - Starting
> RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher
> at akka://flink/user/dispatcher .
>
> jobmanager_1| 2019-08-22 00:05:03,211 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
> ResourceManager akka.tcp://flink@jobmanager:6123/user/resourcemanager was
> granted leadership with fencing token 
>
> jobmanager_1| 2019-08-22 00:05:03,292 INFO
>  org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Dispatcher
> akka.tcp://flink@jobmanager:6123/user/dispatcher was granted leadership
> with fencing token ----
>
> taskmanager_1   | 2019-08-22 00:05:03,713 INFO
>  org.apache.flink.runtime.taskexecutor.TaskExecutor- Connecting
> to ResourceManager
> akka.tcp://flink@jobmanager:6123/user/jobmanager()
> .
> taskmanager_1   | 2019-08-22 00:05:04,137 INFO
>  org.apache.flink.runtime.taskexecutor.TaskExecutor- Could not
> resolve ResourceManager address
> akka.tcp://flink@jobmanager:6123/user/jobmanager, retrying in 1 ms:
> Could not connect to rpc endpoint under address
> akka.tcp://flink@jobmanager:6123/user/jobmanager..
>
> Is this a known bug? I’d appreciate any help I can get.
>
> Thanks,
> Aleksandar Mastilovic
>