Hello Fabian, Thank you very much for the resource. I had already gone through this and have found port '6123' as default for taskmanager registration. But I want to know the specific range of ports the taskmanager access during job execution.
The taskmanager always tries to access a random port during job execution for which I need to disable firewall using 'ufw allow port' during the execution, otherwise the job hangs and finally fails. So I wanted to know a particular range of ports which I can specify in the iptables to always allow access. Kind Regards, Ravinder Kaur On Wed, Feb 10, 2016 at 2:16 PM, Fabian Hueske <[email protected]> wrote: > Hi Ravinder, > > please have a look at the configuration documentation: > > --> > https://ci.apache.org/projects/flink/flink-docs-master/setup/config.html#jobmanager-amp-taskmanager > > Best, Fabian > > 2016-02-10 13:55 GMT+01:00 Ravinder Kaur <[email protected]>: > >> Hello All, >> >> I need to know the range of ports that are being used during the >> master/slave communication in the Flink cluster. Also is there a way I can >> specify a range of ports, at the slaves, to restrict them to connect to >> master only in this range? >> >> Kind Regards, >> Ravinder Kaur >> >> >> On Wed, Feb 3, 2016 at 10:09 PM, Stephan Ewen <[email protected]> wrote: >> >>> Can machines connect to port 6123? The firewall may block that port, put >>> permit SSH. >>> >>> On Wed, Feb 3, 2016 at 9:52 PM, Ravinder Kaur <[email protected]> >>> wrote: >>> >>>> Hello, >>>> >>>> Here is the log file of Jobmanager. I did not see some thing suspicious >>>> and as it suggests the ports are also listening. >>>> >>>> 20:58:46,906 INFO org.apache.flink.runtime.jobmanager.JobManager >>>> - Starting JobManager on IP-of-master:6123 with execution mode >>>> CLUSTER and streaming mode BATCH_ONLY >>>> 20:58:46,978 INFO org.apache.flink.runtime.jobmanager.JobManager >>>> - Security is not enabled. Starting non-authenticated JobManager. >>>> 20:58:46,979 INFO org.apache.flink.runtime.jobmanager.JobManager >>>> - Starting JobManager >>>> 20:58:46,980 INFO org.apache.flink.runtime.jobmanager.JobManager >>>> - Starting JobManager actor system at 10.155.208.138:6123 >>>> 20:58:48,196 INFO akka.event.slf4j.Slf4jLogger >>>> - Slf4jLogger started >>>> 20:58:48,295 INFO Remoting >>>> - Starting remoting >>>> 20:58:48,541 INFO Remoting >>>> - Remoting started; listening on addresses >>>> :[akka.tcp://flink@IP-of-master:6123] >>>> 20:58:48,549 INFO org.apache.flink.runtime.jobmanager.JobManager >>>> - Starting JobManger web frontend >>>> 20:58:48,690 INFO >>>> org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Using >>>> directory /tmp/flink-web-876a4755-4f38-4ff7-8202-f263afa9b986 for the web >>>> interface files >>>> 20:58:48,691 INFO >>>> org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Serving >>>> job manager log from >>>> /home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-hostname.log >>>> 20:58:48,691 INFO >>>> org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Serving >>>> job manager stdout from >>>> /home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-hostname.out >>>> 20:58:49,044 INFO >>>> org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Web >>>> frontend listening at 0:0:0:0:0:0:0:0:8081 >>>> 20:58:49,045 INFO org.apache.flink.runtime.jobmanager.JobManager >>>> - Starting JobManager actor >>>> 20:58:49,052 INFO org.apache.flink.runtime.blob.BlobServer >>>> - Created BLOB server storage directory >>>> /tmp/blobStore-e0c52bfb-2411-4a83-ac8d-5664a5894258 >>>> 20:58:49,054 INFO org.apache.flink.runtime.blob.BlobServer >>>> - Started BLOB server at 0.0.0.0:43683 - max concurrent >>>> requests: 50 - max backlog: 1000 >>>> 20:58:49,075 INFO org.apache.flink.runtime.jobmanager.MemoryArchivist >>>> - Started memory archivist akka://flink/user/archive >>>> 20:58:49,075 INFO org.apache.flink.runtime.jobmanager.JobManager >>>> - Starting JobManager at akka.tcp://flink@IP-of-master >>>> :6123/user/jobmanager. >>>> 20:58:49,081 INFO org.apache.flink.runtime.jobmanager.JobManager >>>> - JobManager akka.tcp://flink@IP-of-master:6123/user/jobmanager >>>> was granted leadership with leader session ID None. >>>> 20:58:49,082 INFO >>>> org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Starting >>>> with JobManager akka.tcp://flink@IP-of-master:6123/user/jobmanager on >>>> port 8081 >>>> 20:58:49,083 INFO >>>> org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader >>>> reachable under akka.tcp://flink@IP-of-master >>>> :6123/user/jobmanager:null. >>>> 20:59:22,794 INFO org.apache.flink.runtime.jobmanager.JobManager >>>> - Submitting job 72733d69588678ec224003ab5577cab8 (Flink Java Job >>>> at Wed Feb 03 20:59:22 CET 2016). >>>> 20:59:22,853 INFO org.apache.flink.runtime.jobmanager.JobManager >>>> - Scheduling job 72733d69588678ec224003ab5577cab8 (Flink Java Job >>>> at Wed Feb 03 20:59:22 CET 2016). >>>> 20:59:22,857 INFO org.apache.flink.runtime.jobmanager.JobManager >>>> - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job >>>> at Wed Feb 03 20:59:22 CET 2016) changed to RUNNING. >>>> 20:59:22,859 INFO >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN >>>> DataSource (at getDefaultTextLineDataSet(WordCountData.java:70) >>>> (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap >>>> at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) >>>> (1/1) (23fb37019a504fd6c7bf95e46a8cd7a3) switched from CREATED to SCHEDULED >>>> 20:59:22,881 INFO >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN >>>> DataSource (at getDefaultTextLineDataSet(WordCountData.java:70) >>>> (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap >>>> at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) >>>> (1/1) (23fb37019a504fd6c7bf95e46a8cd7a3) switched from SCHEDULED to >>>> CANCELED >>>> 20:59:22,881 INFO org.apache.flink.runtime.jobmanager.JobManager >>>> - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job >>>> at Wed Feb 03 20:59:22 CET 2016) changed to FAILING. >>>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: >>>> Not enough free slots available to run the job. You can decrease the >>>> operator parallelism or increase the number of slots per TaskManager in the >>>> configuration. Task to schedule: < Attempt #0 (CHAIN DataSource (at >>>> getDefaultTextLineDataSet(WordCountData.java:70) >>>> (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap >>>> at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) >>>> (1/1)) @ (unassigned) - [SCHEDULED] > with groupID < >>>> 31e497f2f68c9cee5864c8fddaff3d59 > in sharing group < SlotSharingGroup >>>> [f9ed1aab933e061a8ce1ecaa3534f18c, 037bb78a1902f7edea69a978ad7b54ce, >>>> 31e497f2f68c9cee5864c8fddaff3d59] >. Resources available to scheduler: >>>> Number of instances=0, total number of slots=0, available slots=0 >>>> at >>>> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:256) >>>> at >>>> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmediately(Scheduler.java:131) >>>> at >>>> org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:298) >>>> at >>>> org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:458) >>>> at >>>> org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAll(ExecutionJobVertex.java:322) >>>> at >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExecution(ExecutionGraph.java:679) >>>> at >>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982) >>>> at >>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962) >>>> at >>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962) >>>> at >>>> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) >>>> at >>>> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) >>>> at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41) >>>> at >>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401) >>>> at >>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>>> at >>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >>>> at >>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >>>> at >>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >>>> 20:59:22,886 INFO >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN >>>> Reduce (SUM(1), at main(WordCount.java:72) -> FlatMap (collect()) (1/1) >>>> (824b6e3771304cd0f92aea4ab763a11d) switched from CREATED to CANCELED >>>> 20:59:22,887 INFO >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - DataSink >>>> (collect() sink) (1/1) (1bb64a2edc6f68ad716acd9f8d2d7d67) switched from >>>> CREATED to CANCELED >>>> 20:59:22,890 INFO org.apache.flink.runtime.jobmanager.JobManager >>>> - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job >>>> at Wed Feb 03 20:59:22 CET 2016) changed to FAILED. >>>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: >>>> Not enough free slots available to run the job. You can decrease the >>>> operator parallelism or increase the number of slots per TaskManager in the >>>> configuration. Task to schedule: < Attempt #0 (CHAIN DataSource (at >>>> getDefaultTextLineDataSet(WordCountData.java:70) >>>> (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap >>>> at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) >>>> (1/1)) @ (unassigned) - [SCHEDULED] > with groupID < >>>> 31e497f2f68c9cee5864c8fddaff3d59 > in sharing group < SlotSharingGroup >>>> [f9ed1aab933e061a8ce1ecaa3534f18c, 037bb78a1902f7edea69a978ad7b54ce, >>>> 31e497f2f68c9cee5864c8fddaff3d59] >. Resources available to scheduler: >>>> Number of instances=0, total number of slots=0, available slots=0 >>>> at >>>> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:256) >>>> at >>>> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmediately(Scheduler.java:131) >>>> at >>>> org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:298) >>>> at >>>> org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:458) >>>> at >>>> org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAll(ExecutionJobVertex.java:322) >>>> at >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExecution(ExecutionGraph.java:679) >>>> at >>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982) >>>> at >>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962) >>>> at >>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962) >>>> at >>>> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) >>>> at >>>> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) >>>> >>>> >>>> On Wed, Feb 3, 2016 at 9:27 PM, Robert Metzger <[email protected]> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> the TaskManager is starting up, but its not able to register at the >>>>> job manager. Did you check the JobManager log? Do you see anything >>>>> suspicious there? Are the ports matching? >>>>> >>>>> >>>>> On Wed, Feb 3, 2016 at 9:23 PM, Ravinder Kaur <[email protected]> >>>>> wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> Thank you for pointing it out. I had a little typo while I edited the >>>>>> hostname in flink-conf.yaml. I've reset it and the TaskManager started >>>>>> up. >>>>>> But I still can't run the WordCount example and it throws the same >>>>>> NoResourceAvaliableException. >>>>>> >>>>>> Caused by: >>>>>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableExce >>>>>> >>>>>> ption: Not enough free slots available to run the job. You can >>>>>> decrease the oper >>>>>> ator parallelism or increase the number of >>>>>> slots per TaskManager in the configur >>>>>> ation. Task to >>>>>> schedule: < >>>>>> Attempt #0 (CHAIN DataSource (at getDefaultTextLineDa >>>>>> >>>>>> taSet(WordCountData.java:70) >>>>>> (org.apache.flink.api.java.io.CollectionInputFormat >>>>>> )) -> >>>>>> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at >>>>>> main(Wo >>>>>> >>>>>> rdCount.java:72) (1/1)) @ (unassigned) - [SCHEDULED] > with >>>>>> groupID < 31e497f2f6 >>>>>> 8c9cee5864c8fddaff3d59 > in sharing >>>>>> group >>>>>> < SlotSharingGroup [f9ed1aab933e061a8c >>>>>> e1ecaa3534f18c, >>>>>> 037bb78a1902f7edea69a978ad7b54ce, 31e497f2f68c9cee5864c8fddaff3d >>>>>> >>>>>> 59] >. Resources available to scheduler: Number of instances=0, total >>>>>> number of >>>>>> slots=0, available slots=0 >>>>>> at >>>>>> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask( >>>>>> >>>>>> Scheduler.java:256) >>>>>> at >>>>>> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmed >>>>>> >>>>>> iately(Scheduler.java:131) >>>>>> at >>>>>> org.apache.flink.runtime.executiongraph.Execution.scheduleForExecutio >>>>>> >>>>>> n(Execution.java:298) >>>>>> at >>>>>> org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForEx >>>>>> >>>>>> ecution(ExecutionVertex.java:458) >>>>>> at >>>>>> org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAl >>>>>> >>>>>> l(ExecutionJobVertex.java:322) >>>>>> at >>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExe >>>>>> >>>>>> cution(ExecutionGraph.java:679) >>>>>> at >>>>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl >>>>>> >>>>>> >>>>>> >>>>>> ink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982 >>>>>> >>>>>> ) >>>>>> at >>>>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl >>>>>> >>>>>> >>>>>> >>>>>> ink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962) >>>>>> at >>>>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl >>>>>> >>>>>> >>>>>> >>>>>> ink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962) >>>>>> ... 8 more >>>>>> >>>>>> The log of TaskManager again has the same errors as before. >>>>>> >>>>>> 20:58:58,457 INFO org.apache.flink.runtime.net.ConnectionUtils >>>>>> - Failed to connect from address '/slave-IP': connect timed >>>>>> out >>>>>> 20:58:58,458 INFO org.apache.flink.runtime.net.ConnectionUtils >>>>>> - Failed to connect from address '/0:0:0:0:0:0:0:1%1': Network >>>>>> is unreachable >>>>>> 20:58:58,458 INFO org.apache.flink.runtime.net.ConnectionUtils >>>>>> - Failed to connect from address '/127.0.0.1': Invalid >>>>>> argument >>>>>> 20:58:59,048 WARN org.apache.flink.runtime.net.ConnectionUtils >>>>>> - Could not connect to /master-IP:6123. Selecting a local >>>>>> address using heuristics. >>>>>> 20:58:59,050 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>> - TaskManager will use hostname/address 'hostname-of-slave' >>>>>> (slave-IP) for communication. >>>>>> 20:58:59,051 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>> - Starting TaskManager in streaming mode BATCH_ONLY >>>>>> 20:58:59,052 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>> - Starting TaskManager actor system at slave_IP:0 >>>>>> 20:58:59,776 INFO akka.event.slf4j.Slf4jLogger >>>>>> - Slf4jLogger started >>>>>> 20:58:59,842 INFO Remoting >>>>>> - Starting remoting >>>>>> 20:59:00,094 INFO Remoting >>>>>> - Remoting started; listening on addresses >>>>>> :[akka.tcp://flink@slave-IP:33813] >>>>>> 20:59:00,100 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>> - Starting TaskManager actor >>>>>> 20:59:00,125 INFO >>>>>> org.apache.flink.runtime.io.network.netty.NettyConfig - >>>>>> NettyConfig [server address: hostname-of-master/master-IP, server port: >>>>>> 49030, memory segment size (bytes): 32768, transport type: NIO, number of >>>>>> server threads: 0 (use Netty's default), number of client threads: 0 (use >>>>>> Netty's default), server connect backlog: 0 (use Netty's default), client >>>>>> connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use >>>>>> Netty's default)] >>>>>> 20:59:00,131 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>> - Messages between TaskManager and JobManager have a max >>>>>> timeout >>>>>> of 100000 milliseconds >>>>>> 20:59:00,142 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>> - Temporary file directory '/tmp': total 4 GB, usable 1 GB >>>>>> (25.00% usable) >>>>>> 20:59:00,210 INFO >>>>>> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool - >>>>>> Allocated >>>>>> 64 MB for network buffer pool (number of memory segments: 2048, bytes per >>>>>> segment: 32768). >>>>>> 20:59:00,323 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>> - Using 0.7 of the currently free heap space for Flink managed >>>>>> heap memory (293 MB). >>>>>> 20:59:00,565 INFO >>>>>> org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O >>>>>> manager uses directory /tmp/flink-io-c7796b82-6676-4604-97fd-df09001a84e8 >>>>>> for spill files. >>>>>> 20:59:00,578 INFO org.apache.flink.runtime.filecache.FileCache >>>>>> - User file cache uses directory >>>>>> /tmp/flink-dist-cache-13ed3e76-cf1e-46fa-9ba2-5177e801429e >>>>>> 20:59:00,908 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>> - Starting TaskManager actor at >>>>>> akka://flink/user/taskmanager#-157676733. >>>>>> 20:59:00,908 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>> - TaskManager data connection information: hostname-of-master >>>>>> (dataPort=49030) >>>>>> 20:59:00,909 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>> - TaskManager has 1 task slot(s). >>>>>> 20:59:00,910 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>> - Memory usage stats: [HEAP: 376/491/491 MB, NON HEAP: >>>>>> 24/49/304 >>>>>> MB (used/committed/max)] >>>>>> 20:59:00,917 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>> - Trying to register at JobManager >>>>>> akka.tcp://flink@master-IP:6123/user/jobmanager >>>>>> (attempt 1, timeout: 500 milliseconds) >>>>>> 20:59:01,443 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>> - Trying to register at JobManager >>>>>> akka.tcp://flink@master-IP:6123/user/jobmanager >>>>>> (attempt 2, timeout: 1000 milliseconds) >>>>>> 20:59:02,873 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>> - Trying to register at JobManager >>>>>> akka.tcp://flink@master-IP:6123/user/jobmanager >>>>>> (attempt 3, timeout: 2000 milliseconds) >>>>>> 20:59:04,893 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>> - Trying to register at JobManager >>>>>> akka.tcp://flink@master-IP:6123/user/jobmanager >>>>>> (attempt 4, timeout: 4000 milliseconds) >>>>>> 20:59:08,914 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>> - Trying to register at JobManager >>>>>> akka.tcp://flink@master-IP:6123/user/jobmanager >>>>>> (attempt 5, timeout: 8000 milliseconds) >>>>>> >>>>>> >>>>>> Kind Regards, >>>>>> Ravinder Kaur >>>>>> >>>>>> On Wed, Feb 3, 2016 at 8:12 PM, Stephan Ewen <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> This looks like the reason: >>>>>>> >>>>>>> java.net.UnknownHostException: Cannot resolve the JobManager >>>>>>> hostname 'hostname-of-master' specified in the configuration >>>>>>> >>>>>>> On Wed, Feb 3, 2016 at 7:29 PM, Ravinder Kaur <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hello, >>>>>>>> >>>>>>>> The log file of the Taskmanager now shows the following >>>>>>>> >>>>>>>> 18:27:10,082 WARN org.apache.hadoop.util.NativeCodeLoader >>>>>>>> - Unable to load native-hadoop library for your >>>>>>>> platform... >>>>>>>> using builtin-java classes where applicable >>>>>>>> 18:27:10,244 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>> - >>>>>>>> -------------------------------------------------------------------------------- >>>>>>>> 18:27:10,244 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>> - Starting TaskManager (Version: 0.10.1, Rev:2e9b231, >>>>>>>> Date:22.11.2015 @ 12:41:12 CET) >>>>>>>> 18:27:10,244 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>> - Current user: flink >>>>>>>> 18:27:10,245 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>> - JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - >>>>>>>> 1.7/24.91-b01 >>>>>>>> 18:27:10,245 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>> - Maximum heap size: 491 MiBytes >>>>>>>> 18:27:10,245 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>> - JAVA_HOME: /usr/lib/jvm/java-1.7.0-openjdk-amd64 >>>>>>>> 18:27:10,247 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>> - Hadoop version: 2.7.0 >>>>>>>> 18:27:10,247 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>> - JVM Options: >>>>>>>> 18:27:10,247 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>> - -Xms512M >>>>>>>> 18:27:10,247 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>> - -Xmx512M >>>>>>>> 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>> - -XX:MaxDirectMemorySize=8388607T >>>>>>>> 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>> - -XX:MaxPermSize=256m >>>>>>>> 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>> - >>>>>>>> -Dlog.file=/home/flink/flink-0.10.1/log/flink-flink-taskmanager-0-vm-10-155-208-137.cloud.mwn.de.log >>>>>>>> 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>> - >>>>>>>> -Dlog4j.configuration=file:/home/flink/flink-0.10.1/conf/log4j.properties >>>>>>>> 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>> - >>>>>>>> -Dlogback.configurationFile=file:/home/flink/flink-0.10.1/conf/logback.xml >>>>>>>> 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>> - Program Arguments: >>>>>>>> 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>> - --configDir >>>>>>>> 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>> - /home/flink/flink-0.10.1/conf >>>>>>>> 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>> - --streamingMode >>>>>>>> 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>> - batch >>>>>>>> 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>> - Classpath: >>>>>>>> /home/flink/flink-0.10.1/lib/flink-dist_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/flink-python_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/log4j-1.2.17.jar:/home/flink/flink-0.10.1/lib/slf4j-log4j12-1.7.7.jar:/usr/lib/jvm/java-1.7.0-openjdk-amd64/lib/tools.jar:: >>>>>>>> 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>> - >>>>>>>> -------------------------------------------------------------------------------- >>>>>>>> 18:27:10,252 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>> - Maximum number of open file descriptors is 4096 >>>>>>>> 18:27:10,277 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>> - Loading configuration from /home/flink/flink-0.10.1/conf >>>>>>>> 18:27:10,356 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>> - Security is not enabled. Starting non-authenticated >>>>>>>> TaskManager. >>>>>>>> 18:27:10,365 ERROR org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>> - Failed to run TaskManager. >>>>>>>> java.net.UnknownHostException: Cannot resolve the JobManager >>>>>>>> hostname 'hostname-of-master' specified in the configuration >>>>>>>> at >>>>>>>> org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:79) >>>>>>>> at >>>>>>>> org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:48) >>>>>>>> at >>>>>>>> org.apache.flink.runtime.util.LeaderRetrievalUtils.createLeaderRetrievalService(LeaderRetrievalUtils.java:69) >>>>>>>> at >>>>>>>> org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndPort(TaskManager.scala:1351) >>>>>>>> at >>>>>>>> org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndRunTaskManager(TaskManager.scala:1328) >>>>>>>> at >>>>>>>> org.apache.flink.runtime.taskmanager.TaskManager$.main(TaskManager.scala:1240) >>>>>>>> at >>>>>>>> org.apache.flink.runtime.taskmanager.TaskManager.main(TaskManager.scala) >>>>>>>> >>>>>>>> Kind Regards, >>>>>>>> Ravinder Kaur >>>>>>>> >>>>>>>> On Wed, Feb 3, 2016 at 7:19 PM, Stephan Ewen <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> What do the TaskManger logs say? >>>>>>>>> >>>>>>>>> On Wed, Feb 3, 2016 at 6:34 PM, Ravinder Kaur <[email protected] >>>>>>>>> > wrote: >>>>>>>>> >>>>>>>>>> Hello, >>>>>>>>>> >>>>>>>>>> Thanks for the quick reply. I tried to set jobmanager.rpc.address >>>>>>>>>> in flink-conf.yaml to the hostname of master node on both the nodes. >>>>>>>>>> >>>>>>>>>> Now it does not start the Taskmanager at the worker node at all. >>>>>>>>>> When I start the cluster using ./bin/start-cluster.sh on master it >>>>>>>>>> shows >>>>>>>>>> the normal output of starting the Jobmanager and Taskmanager but >>>>>>>>>> when I run >>>>>>>>>> jps on the nodes the slave does not have the Taskmanager running. >>>>>>>>>> >>>>>>>>>> Running the WordCount example again fails showing the same error. >>>>>>>>>> Stopping the cluster says no taskmanager to stop. >>>>>>>>>> >>>>>>>>>> Kind Regards, >>>>>>>>>> Ravinder Kaur >>>>>>>>>> >>>>>>>>>> On Wed, Feb 3, 2016 at 5:47 PM, Stephan Ewen <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Looks like the network configuration is not correct. >>>>>>>>>>> >>>>>>>>>>> I would try setting the full host name (like "master.abc.xyz.com") >>>>>>>>>>> as jobmanager.rpc.address. >>>>>>>>>>> >>>>>>>>>>> Greetings, >>>>>>>>>>> Stephan >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, Feb 3, 2016 at 5:43 PM, Ravinder Kaur < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Hello Community, >>>>>>>>>>>> >>>>>>>>>>>> I'm a student and new to Apache Flink. I'm trying to learn and >>>>>>>>>>>> have setup a 2- node standalone Flink(0.10.1) cluster (one master >>>>>>>>>>>> and one >>>>>>>>>>>> worker). I'm facing the following issue. >>>>>>>>>>>> >>>>>>>>>>>> Cluster: consists of 2 vms (one master and one worker) >>>>>>>>>>>> >>>>>>>>>>>> The configurations are done as per >>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-0.10/setup/cluster_setup.html >>>>>>>>>>>> >>>>>>>>>>>> When I start the cluster both the JobManager and the >>>>>>>>>>>> TaskManager are started on the master and worker respectively. >>>>>>>>>>>> >>>>>>>>>>>> Command to start the cluster : bin/start-cluster.sh >>>>>>>>>>>> >>>>>>>>>>>> JPS shows all the processes running. >>>>>>>>>>>> >>>>>>>>>>>> Then I run the following command to run a WordCount example >>>>>>>>>>>> job: ./bin/flink run ./examples/WordCount.jar >>>>>>>>>>>> >>>>>>>>>>>> the result is attached with the mail. >>>>>>>>>>>> >>>>>>>>>>>> The error is >>>>>>>>>>>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailabeException: >>>>>>>>>>>> Not enough free slots available to run to run the job >>>>>>>>>>>> ....................... Resources available to scheduler: Number of >>>>>>>>>>>> instances=0, total number of slots= 0, available slots=0 >>>>>>>>>>>> >>>>>>>>>>>> Therefore I suppose that the JobManager does not find the >>>>>>>>>>>> TaskManager and checked the logs of the TaskManager which indeed >>>>>>>>>>>> shows that >>>>>>>>>>>> the TaskManager is unable to register at the JobManager for quite >>>>>>>>>>>> a long >>>>>>>>>>>> time. There are org.apache.flink.runtime.net.ConnectionUtils: >>>>>>>>>>>> Failed to connect from localhost: Connect timed out and >>>>>>>>>>>> org.apache.flink.runtime.net.ConnectionUtils: >>>>>>>>>>>> Failed to connect from address localhost: Network is >>>>>>>>>>>> Unreachable messages in the log of the TaskManager. Later when >>>>>>>>>>>> it starts up after a number of attempts and tries to register at >>>>>>>>>>>> the >>>>>>>>>>>> JobManager, which also fails after a lot of attempts showing the >>>>>>>>>>>> following >>>>>>>>>>>> message org.apache.flink.runtime.taskmanager.Taskmanager: >>>>>>>>>>>> Trying to register at JobManager >>>>>>>>>>>> akka.tcp://flink@master:6123/user'/jobmanager >>>>>>>>>>>> (attempt:92, timeout:30seconds) and >>>>>>>>>>>> org.apache.flink.runtime.taskmanager.Taskmanager: >>>>>>>>>>>> Tried to associate with unreachable remote host >>>>>>>>>>>> [akka.tcp://flink@master:6123/user/jobmanager]. >>>>>>>>>>>> Address is now gated for 5000ms, all messages to this address will >>>>>>>>>>>> be >>>>>>>>>>>> delivered to dead letters. Reason: Connection timed out: >>>>>>>>>>>> /master:6123 >>>>>>>>>>>> >>>>>>>>>>>> I browsed the internet for these and found >>>>>>>>>>>> >>>>>>>>>>>> http://stackoverflow.com/questions/33601020/flink-job-wont-run-with-higher-taskmanager-heap-mb >>>>>>>>>>>> <http://stackoverflow.com/questions/33601020/flink-job-wont-run-with-higher-taskmanager-heap-mb> >>>>>>>>>>>> and https://issues.apache.org/jira/browse/FLINK-1119 these >>>>>>>>>>>> links helpful. Stephan Ewen the guy who provided the solution in >>>>>>>>>>>> both the >>>>>>>>>>>> links gives a good explanation that the TaskManagers take quite >>>>>>>>>>>> some time >>>>>>>>>>>> to register at the JobManager and therefore I waited for as long >>>>>>>>>>>> as 20 mins >>>>>>>>>>>> after starting the cluster to run the job. But even after waiting >>>>>>>>>>>> so long I >>>>>>>>>>>> get the same error. >>>>>>>>>>>> >>>>>>>>>>>> Another suggestion was to run the cluster in streaming mode. So >>>>>>>>>>>> I tried it with the command : bin/start-cluster-streaming.sh and >>>>>>>>>>>> ran the job but I get the same error. I have rechecked all the >>>>>>>>>>>> configurations but I'm unable to find out the fault. >>>>>>>>>>>> >>>>>>>>>>>> I re-checked all the configurations but could not find anything >>>>>>>>>>>> wrong. Also checked the port 6123 on master which is in LISTEN >>>>>>>>>>>> state and >>>>>>>>>>>> tcp request from worker to master shows SYN_SENT state using >>>>>>>>>>>> netstat -na >>>>>>>>>>>> and lsof -i commands. >>>>>>>>>>>> >>>>>>>>>>>> I opened the webpage on master http://localhost:8081 but it >>>>>>>>>>>> shows nothing and localhost:8080 says connection refused. >>>>>>>>>>>> >>>>>>>>>>>> Kindly help me out as it is very important for me. Let me know >>>>>>>>>>>> if you have any questions. >>>>>>>>>>>> >>>>>>>>>>>> Kind Regards, >>>>>>>>>>>> Ravinder Kaur >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
