Re: Flink on YARN: Stuck on "Trying to register at JobManager"

Pieter Hameete Mon, 08 Feb 2016 10:45:04 -0800

Solved: indeed it needed to be built for YARN 2.7.1 specifically. Cheers!

2016-02-08 19:13 GMT+01:00 Robert Metzger <rmetz...@apache.org>:


> Mh, that's weird. Maybe both resource managers are marked as "standby"?
> Not sure what can cause this issue.
>
> Which YARN version are you using? Maybe you need to build Flink against
> that specific hadoop version yourself.
>
> On Mon, Feb 8, 2016 at 5:50 PM, Pieter Hameete <phame...@gmail.com> wrote:
>
>> After downloading and building the 1.0-SNAPSHOT from the master branch I
>> do run into another problem when starting a YARN cluster. The startup now
>> infinitely loops at the following step:
>>
>> 17:39:12,369 INFO
>> org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider  - Failing
>> over to rm2
>> 17:39:34,855 INFO
>> org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider  - Failing
>> over to rm1
>>
>> Any clue what couldve gone wrong? I used all-default for building with
>> maven.
>>
>> - Pieter
>>
>>
>>
>> 2016-02-08 17:07 GMT+01:00 Pieter Hameete <phame...@gmail.com>:
>>
>>> Matter of RTFM eh ;-) thx and sorry for the bother.
>>>
>>> 2016-02-08 17:06 GMT+01:00 Robert Metzger <rmetz...@apache.org>:
>>>
>>>> You said earlier that you are using Flink 0.10. The feature is only
>>>> available in 1.0-SNAPSHOT.
>>>>
>>>> On Mon, Feb 8, 2016 at 4:53 PM, Pieter Hameete <phame...@gmail.com>
>>>> wrote:
>>>>
>>>>> Ive tried setting the yarn.application-master.port property in
>>>>> flink-conf.yaml to a range suggested in
>>>>> https://ci.apache.org/projects/flink/flink-docs-master/setup/yarn_setup.html#running-flink-on-yarn-behind-fi
>>>>> rewalls
>>>>>
>>>>> The JobManager does not seem to be picking the property up. Am I
>>>>> setting this in the wrong place? Or is there another way to enforce this
>>>>> property?
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Pieter
>>>>>
>>>>> 2016-02-07 20:04 GMT+01:00 Pieter Hameete <phame...@gmail.com>:
>>>>>
>>>>>> I found the relevant information on the website. Ill consult with the
>>>>>> cluster admin tomorrow, thanks for the help :-)
>>>>>>
>>>>>> - Pieter
>>>>>>
>>>>>> 2016-02-07 19:31 GMT+01:00 Robert Metzger <rmetz...@apache.org>:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> we had other users with a similar issue as well. There is a
>>>>>>> configuration value which allows you to specify a single port or a 
>>>>>>> range of
>>>>>>> ports for the JobManager to allocate when running on YARN.
>>>>>>> Note that when using this with a single port, the JMs may collide.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Feb 7, 2016 at 7:25 PM, Pieter Hameete <phame...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Stephan,
>>>>>>>>
>>>>>>>> surely it seems this way! I must not be the first with this issue
>>>>>>>> though? I'll have to contact the cluster admins to find a solution
>>>>>>>> together. What would be a way of make the JobManagers accessible from
>>>>>>>> outside the network, because the IP and port number changes every time.
>>>>>>>>
>>>>>>>> Alternatively, I can ask for ssh access to a node within the
>>>>>>>> network. that will surely work but it's not my preferred solution.
>>>>>>>>
>>>>>>>> - Pieter
>>>>>>>>
>>>>>>>> 2016-02-06 16:22 GMT+01:00 Stephan Ewen <se...@apache.org>:
>>>>>>>>
>>>>>>>>> Yeah, sounds a lot like the client cannot connect to the
>>>>>>>>> JobManager port.
>>>>>>>>>
>>>>>>>>> The ports to communicate with HDFS and the YARN resource manager
>>>>>>>>> may be whitelisted r forwarded, so you can submit the YARN session, 
>>>>>>>>> but
>>>>>>>>> then not connect to the JobManager afterwards.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, Feb 6, 2016 at 2:11 PM, Pieter Hameete <phame...@gmail.com
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Hi Max!
>>>>>>>>>>
>>>>>>>>>> I'm using Flink 0.10.1 and indeed the cluster seems to be created
>>>>>>>>>> fine, all in the JobManager Web UI looks good.
>>>>>>>>>>
>>>>>>>>>> It seems like the JobManager initiates the connection with my VM
>>>>>>>>>> and cannot reach it. It could be that this is similar to the problem 
>>>>>>>>>> here:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-with-docker-errors-with-akka-NAT-td7702.html
>>>>>>>>>>
>>>>>>>>>> I probably have to make some changes to the networking
>>>>>>>>>> configuration of my VM so it can be reached by the JobManager 
>>>>>>>>>> despite using
>>>>>>>>>> a different port each time.
>>>>>>>>>>
>>>>>>>>>> - Pieter
>>>>>>>>>>
>>>>>>>>>> 2016-02-06 14:05 GMT+01:00 Maximilian Michels <m...@apache.org>:
>>>>>>>>>>
>>>>>>>>>>> Hi Pieter,
>>>>>>>>>>>
>>>>>>>>>>> Which version of Flink are you using? It appears you've created a
>>>>>>>>>>> Flink YARN cluster but you can't reach the JobManager afterwards.
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Max
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Feb 6, 2016 at 1:42 PM, Pieter Hameete <
>>>>>>>>>>> phame...@gmail.com> wrote:
>>>>>>>>>>> > Hi Robert,
>>>>>>>>>>> >
>>>>>>>>>>> > unfortunately there are no signs of what is going wrong in the
>>>>>>>>>>> logs. The
>>>>>>>>>>> > last log messages are about succesful registration of the
>>>>>>>>>>> TaskManagers.
>>>>>>>>>>> >
>>>>>>>>>>> > I'm also fairly sure it must be something in my VM that is
>>>>>>>>>>> causing this,
>>>>>>>>>>> > because when I start the yarn-session from a login node that
>>>>>>>>>>> is on the same
>>>>>>>>>>> > network as the hadoop cluster there are no problems
>>>>>>>>>>> registering with the
>>>>>>>>>>> > JobManager. I did also notice the following message in the
>>>>>>>>>>> local console:
>>>>>>>>>>> >
>>>>>>>>>>> > 12:30:27,173 WARN  Remoting
>>>>>>>>>>> > - Tried to associate with unreachable remote address
>>>>>>>>>>> > [akka.tcp://flink@145.100.41.13:41539]. Address is now gated
>>>>>>>>>>> for 5000 ms,
>>>>>>>>>>> > all messages to this address will be delivered to dead
>>>>>>>>>>> letters. Reason:
>>>>>>>>>>> > connection timed out: /145.100.41.13:41539
>>>>>>>>>>> >
>>>>>>>>>>> > I can ping the JobManager fine from with VM. Could there be
>>>>>>>>>>> some invalid or
>>>>>>>>>>> > missing configuration on my side?
>>>>>>>>>>> >
>>>>>>>>>>> > Cheers,
>>>>>>>>>>> >
>>>>>>>>>>> > Pieter
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> > 2016-02-06 12:54 GMT+01:00 Robert Metzger <rmetz...@apache.org
>>>>>>>>>>> >:
>>>>>>>>>>> >>
>>>>>>>>>>> >> Hi,
>>>>>>>>>>> >>
>>>>>>>>>>> >> did you check the logs of the JobManager itself? Maybe it'll
>>>>>>>>>>> tell us
>>>>>>>>>>> >> already whats going on.
>>>>>>>>>>> >>
>>>>>>>>>>> >> On Sat, Feb 6, 2016 at 12:14 PM, Pieter Hameete <
>>>>>>>>>>> phame...@gmail.com>
>>>>>>>>>>> >> wrote:
>>>>>>>>>>> >>>
>>>>>>>>>>> >>> Hi Guys!
>>>>>>>>>>> >>>
>>>>>>>>>>> >>> Im attempting to run Flink on YARN, but I run into an issue.
>>>>>>>>>>> Im starting
>>>>>>>>>>> >>> the Flink YARN session from an Ubuntu 14.04 VM. All goes
>>>>>>>>>>> well until after
>>>>>>>>>>> >>> the JobManager web UI is started:
>>>>>>>>>>> >>>
>>>>>>>>>>> >>> JobManager web interface address
>>>>>>>>>>> >>>
>>>>>>>>>>> http://head05.hathi.surfsara.nl:8088/proxy/application_1452780322684_10532/
>>>>>>>>>>> >>> Waiting until all TaskManagers have connected
>>>>>>>>>>> >>> 11:09:51,557 INFO  org.apache.flink.yarn.ApplicationClient
>>>>>>>>>>> >>> - Notification about new leader address
>>>>>>>>>>> >>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager with
>>>>>>>>>>> session ID null.
>>>>>>>>>>> >>> No status updates from the YARN cluster received so far.
>>>>>>>>>>> Waiting ...
>>>>>>>>>>> >>> 11:09:51,578 INFO  org.apache.flink.yarn.ApplicationClient
>>>>>>>>>>> >>> - Received address of new leader
>>>>>>>>>>> >>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager with
>>>>>>>>>>> session ID null.
>>>>>>>>>>> >>> 11:09:51,583 INFO  org.apache.flink.yarn.ApplicationClient
>>>>>>>>>>> >>> - Disconnect from JobManager null.
>>>>>>>>>>> >>> 11:09:51,595 INFO  org.apache.flink.yarn.ApplicationClient
>>>>>>>>>>> >>> - Trying to register at JobManager
>>>>>>>>>>> >>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager.
>>>>>>>>>>> >>> No status updates from the YARN cluster received so far.
>>>>>>>>>>> Waiting ...
>>>>>>>>>>> >>> No status updates from the YARN cluster received so far.
>>>>>>>>>>> Waiting ...
>>>>>>>>>>> >>>
>>>>>>>>>>> >>> It then hangs on these last steps (trying to register, no
>>>>>>>>>>> status
>>>>>>>>>>> >>> updates..)
>>>>>>>>>>> >>>
>>>>>>>>>>> >>> Im sure there must be a problem on my side that is causing
>>>>>>>>>>> me not to be
>>>>>>>>>>> >>> able to register at the JobManager. What could cause such
>>>>>>>>>>> connection
>>>>>>>>>>> >>> problems?
>>>>>>>>>>> >>>
>>>>>>>>>>> >>> Any tips are very welcome :-)
>>>>>>>>>>> >>>
>>>>>>>>>>> >>> Cheers and have a good weekend!
>>>>>>>>>>> >>>
>>>>>>>>>>> >>> - Pieter
>>>>>>>>>>> >>>
>>>>>>>>>>> >>>
>>>>>>>>>>> >>
>>>>>>>>>>> >
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Flink on YARN: Stuck on "Trying to register at JobManager"

Reply via email to