Solved: indeed it needed to be built for YARN 2.7.1 specifically. Cheers! 2016-02-08 19:13 GMT+01:00 Robert Metzger <rmetz...@apache.org>:
> Mh, that's weird. Maybe both resource managers are marked as "standby"? > Not sure what can cause this issue. > > Which YARN version are you using? Maybe you need to build Flink against > that specific hadoop version yourself. > > On Mon, Feb 8, 2016 at 5:50 PM, Pieter Hameete <phame...@gmail.com> wrote: > >> After downloading and building the 1.0-SNAPSHOT from the master branch I >> do run into another problem when starting a YARN cluster. The startup now >> infinitely loops at the following step: >> >> 17:39:12,369 INFO >> org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider - Failing >> over to rm2 >> 17:39:34,855 INFO >> org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider - Failing >> over to rm1 >> >> Any clue what couldve gone wrong? I used all-default for building with >> maven. >> >> - Pieter >> >> >> >> 2016-02-08 17:07 GMT+01:00 Pieter Hameete <phame...@gmail.com>: >> >>> Matter of RTFM eh ;-) thx and sorry for the bother. >>> >>> 2016-02-08 17:06 GMT+01:00 Robert Metzger <rmetz...@apache.org>: >>> >>>> You said earlier that you are using Flink 0.10. The feature is only >>>> available in 1.0-SNAPSHOT. >>>> >>>> On Mon, Feb 8, 2016 at 4:53 PM, Pieter Hameete <phame...@gmail.com> >>>> wrote: >>>> >>>>> Ive tried setting the yarn.application-master.port property in >>>>> flink-conf.yaml to a range suggested in >>>>> https://ci.apache.org/projects/flink/flink-docs-master/setup/yarn_setup.html#running-flink-on-yarn-behind-fi >>>>> rewalls >>>>> >>>>> The JobManager does not seem to be picking the property up. Am I >>>>> setting this in the wrong place? Or is there another way to enforce this >>>>> property? >>>>> >>>>> Cheers, >>>>> >>>>> Pieter >>>>> >>>>> 2016-02-07 20:04 GMT+01:00 Pieter Hameete <phame...@gmail.com>: >>>>> >>>>>> I found the relevant information on the website. Ill consult with the >>>>>> cluster admin tomorrow, thanks for the help :-) >>>>>> >>>>>> - Pieter >>>>>> >>>>>> 2016-02-07 19:31 GMT+01:00 Robert Metzger <rmetz...@apache.org>: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> we had other users with a similar issue as well. There is a >>>>>>> configuration value which allows you to specify a single port or a >>>>>>> range of >>>>>>> ports for the JobManager to allocate when running on YARN. >>>>>>> Note that when using this with a single port, the JMs may collide. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Sun, Feb 7, 2016 at 7:25 PM, Pieter Hameete <phame...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Stephan, >>>>>>>> >>>>>>>> surely it seems this way! I must not be the first with this issue >>>>>>>> though? I'll have to contact the cluster admins to find a solution >>>>>>>> together. What would be a way of make the JobManagers accessible from >>>>>>>> outside the network, because the IP and port number changes every time. >>>>>>>> >>>>>>>> Alternatively, I can ask for ssh access to a node within the >>>>>>>> network. that will surely work but it's not my preferred solution. >>>>>>>> >>>>>>>> - Pieter >>>>>>>> >>>>>>>> 2016-02-06 16:22 GMT+01:00 Stephan Ewen <se...@apache.org>: >>>>>>>> >>>>>>>>> Yeah, sounds a lot like the client cannot connect to the >>>>>>>>> JobManager port. >>>>>>>>> >>>>>>>>> The ports to communicate with HDFS and the YARN resource manager >>>>>>>>> may be whitelisted r forwarded, so you can submit the YARN session, >>>>>>>>> but >>>>>>>>> then not connect to the JobManager afterwards. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sat, Feb 6, 2016 at 2:11 PM, Pieter Hameete <phame...@gmail.com >>>>>>>>> > wrote: >>>>>>>>> >>>>>>>>>> Hi Max! >>>>>>>>>> >>>>>>>>>> I'm using Flink 0.10.1 and indeed the cluster seems to be created >>>>>>>>>> fine, all in the JobManager Web UI looks good. >>>>>>>>>> >>>>>>>>>> It seems like the JobManager initiates the connection with my VM >>>>>>>>>> and cannot reach it. It could be that this is similar to the problem >>>>>>>>>> here: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-with-docker-errors-with-akka-NAT-td7702.html >>>>>>>>>> >>>>>>>>>> I probably have to make some changes to the networking >>>>>>>>>> configuration of my VM so it can be reached by the JobManager >>>>>>>>>> despite using >>>>>>>>>> a different port each time. >>>>>>>>>> >>>>>>>>>> - Pieter >>>>>>>>>> >>>>>>>>>> 2016-02-06 14:05 GMT+01:00 Maximilian Michels <m...@apache.org>: >>>>>>>>>> >>>>>>>>>>> Hi Pieter, >>>>>>>>>>> >>>>>>>>>>> Which version of Flink are you using? It appears you've created a >>>>>>>>>>> Flink YARN cluster but you can't reach the JobManager afterwards. >>>>>>>>>>> >>>>>>>>>>> Cheers, >>>>>>>>>>> Max >>>>>>>>>>> >>>>>>>>>>> On Sat, Feb 6, 2016 at 1:42 PM, Pieter Hameete < >>>>>>>>>>> phame...@gmail.com> wrote: >>>>>>>>>>> > Hi Robert, >>>>>>>>>>> > >>>>>>>>>>> > unfortunately there are no signs of what is going wrong in the >>>>>>>>>>> logs. The >>>>>>>>>>> > last log messages are about succesful registration of the >>>>>>>>>>> TaskManagers. >>>>>>>>>>> > >>>>>>>>>>> > I'm also fairly sure it must be something in my VM that is >>>>>>>>>>> causing this, >>>>>>>>>>> > because when I start the yarn-session from a login node that >>>>>>>>>>> is on the same >>>>>>>>>>> > network as the hadoop cluster there are no problems >>>>>>>>>>> registering with the >>>>>>>>>>> > JobManager. I did also notice the following message in the >>>>>>>>>>> local console: >>>>>>>>>>> > >>>>>>>>>>> > 12:30:27,173 WARN Remoting >>>>>>>>>>> > - Tried to associate with unreachable remote address >>>>>>>>>>> > [akka.tcp://flink@145.100.41.13:41539]. Address is now gated >>>>>>>>>>> for 5000 ms, >>>>>>>>>>> > all messages to this address will be delivered to dead >>>>>>>>>>> letters. Reason: >>>>>>>>>>> > connection timed out: /145.100.41.13:41539 >>>>>>>>>>> > >>>>>>>>>>> > I can ping the JobManager fine from with VM. Could there be >>>>>>>>>>> some invalid or >>>>>>>>>>> > missing configuration on my side? >>>>>>>>>>> > >>>>>>>>>>> > Cheers, >>>>>>>>>>> > >>>>>>>>>>> > Pieter >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > 2016-02-06 12:54 GMT+01:00 Robert Metzger <rmetz...@apache.org >>>>>>>>>>> >: >>>>>>>>>>> >> >>>>>>>>>>> >> Hi, >>>>>>>>>>> >> >>>>>>>>>>> >> did you check the logs of the JobManager itself? Maybe it'll >>>>>>>>>>> tell us >>>>>>>>>>> >> already whats going on. >>>>>>>>>>> >> >>>>>>>>>>> >> On Sat, Feb 6, 2016 at 12:14 PM, Pieter Hameete < >>>>>>>>>>> phame...@gmail.com> >>>>>>>>>>> >> wrote: >>>>>>>>>>> >>> >>>>>>>>>>> >>> Hi Guys! >>>>>>>>>>> >>> >>>>>>>>>>> >>> Im attempting to run Flink on YARN, but I run into an issue. >>>>>>>>>>> Im starting >>>>>>>>>>> >>> the Flink YARN session from an Ubuntu 14.04 VM. All goes >>>>>>>>>>> well until after >>>>>>>>>>> >>> the JobManager web UI is started: >>>>>>>>>>> >>> >>>>>>>>>>> >>> JobManager web interface address >>>>>>>>>>> >>> >>>>>>>>>>> http://head05.hathi.surfsara.nl:8088/proxy/application_1452780322684_10532/ >>>>>>>>>>> >>> Waiting until all TaskManagers have connected >>>>>>>>>>> >>> 11:09:51,557 INFO org.apache.flink.yarn.ApplicationClient >>>>>>>>>>> >>> - Notification about new leader address >>>>>>>>>>> >>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager with >>>>>>>>>>> session ID null. >>>>>>>>>>> >>> No status updates from the YARN cluster received so far. >>>>>>>>>>> Waiting ... >>>>>>>>>>> >>> 11:09:51,578 INFO org.apache.flink.yarn.ApplicationClient >>>>>>>>>>> >>> - Received address of new leader >>>>>>>>>>> >>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager with >>>>>>>>>>> session ID null. >>>>>>>>>>> >>> 11:09:51,583 INFO org.apache.flink.yarn.ApplicationClient >>>>>>>>>>> >>> - Disconnect from JobManager null. >>>>>>>>>>> >>> 11:09:51,595 INFO org.apache.flink.yarn.ApplicationClient >>>>>>>>>>> >>> - Trying to register at JobManager >>>>>>>>>>> >>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager. >>>>>>>>>>> >>> No status updates from the YARN cluster received so far. >>>>>>>>>>> Waiting ... >>>>>>>>>>> >>> No status updates from the YARN cluster received so far. >>>>>>>>>>> Waiting ... >>>>>>>>>>> >>> >>>>>>>>>>> >>> It then hangs on these last steps (trying to register, no >>>>>>>>>>> status >>>>>>>>>>> >>> updates..) >>>>>>>>>>> >>> >>>>>>>>>>> >>> Im sure there must be a problem on my side that is causing >>>>>>>>>>> me not to be >>>>>>>>>>> >>> able to register at the JobManager. What could cause such >>>>>>>>>>> connection >>>>>>>>>>> >>> problems? >>>>>>>>>>> >>> >>>>>>>>>>> >>> Any tips are very welcome :-) >>>>>>>>>>> >>> >>>>>>>>>>> >>> Cheers and have a good weekend! >>>>>>>>>>> >>> >>>>>>>>>>> >>> - Pieter >>>>>>>>>>> >>> >>>>>>>>>>> >>> >>>>>>>>>>> >> >>>>>>>>>>> > >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >