I checked, but JVMs didn't crash. No puppet or other services like that. One thing I found is that things work OK when I have a smaller number of slaves. For example, here I was trying to run on 16 nodes giving 2 TMs each. Then I reduced it to 4 nodes each with 2 TMs, which worked.
On Fri, Jul 8, 2016 at 12:31 PM, Robert Metzger <rmetz...@apache.org> wrote: > Hi, > from the TaskManager logs, I can not see anything suspicious. > Its a bit weird that the TaskManager logs just end, without any shutdown > messages. Usually the TMs log some shut down stuff when they are stopping. > Also, if they would be still running, I would expect some error messages > from akka about the connection status. > So the only thing I conclude is that one of the TMs was killed by the OS > or the JVM crashed. Did you check if that happened? > > Do you have any service like puppet that is controlling processes? > > > On Thu, Jul 7, 2016 at 5:46 PM, Saliya Ekanayake <esal...@gmail.com> > wrote: > >> I see two logs (attached), but there's only 1 TaskManger process. Also, >> the Web console says it can find only 1 TM. >> >> However, I see this part in JM log, which shows there was a second TM at >> one point, but it was unregistered. Any thoughts? >> >> -------------------------- >> >> - Registered TaskManager at j-002 (akka.tcp:// >> flink@172.16.0.2:42888/user/taskmanager) as >> 1c65415701f19978c8a8cdc75c993717. Current number of registered hosts is 1. >> Current number of alive task slots is 12. >> >> 2016-07-07 11:32:40,363 WARN akka.remote.ReliableDeliverySupervisor - >> Association with remote system [akka.tcp://flink@172.16.0.2:42888] has >> failed, address is now gated for [5000] ms. Reason is: [Disassociated]. >> >> 2016-07-07 11:32:42,722 INFO >> org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager >> at j-002 (akka.tcp://flink@172.16.0.2:37373/user/taskmanager) as >> 9c4ec66f5acbc19f7931fcae8345cd4e. Current number of registered hosts is 2. >> Current number of alive task slots is 24. >> >> 2016-07-07 11:33:15,316 WARN Remoting - Tried to associate with >> unreachable remote address [akka.tcp://flink@172.16.0.2:42888]. Address >> is now gated for 5000 ms, all messages to this address will be delivered to >> dead letters. Reason: Connection refused: /172.16.0.2:42888 >> >> 2016-07-07 11:33:15,320 INFO >> org.apache.flink.runtime.jobmanager.JobManager - Task manager akka.tcp:// >> flink@172.16.0.2:42888/user/taskmanager terminated. >> 2016-07-07 11:33:15,320 INFO >> org.apache.flink.runtime.instance.InstanceManager - Unregistered task >> manager akka.tcp://flink@172.16.0.2:42888/user/taskmanager. Number of >> registered task managers 1. Number of available slots 12. >> >> >> On Thu, Jul 7, 2016 at 4:27 AM, Ufuk Celebi <u...@apache.org> wrote: >> >>> No that should suffice. Can you check whether there are any task >>> manager logs for the second TM on that machine >>> (taskmanager-X-j-011.log where X is the TM number)? If yes, the task >>> manager process does start up and there is another problem. If not, >>> the task managers seems not to start even. >>> >>> – Ufuk >>> >>> On Thu, Jul 7, 2016 at 7:34 AM, Saliya Ekanayake <esal...@gmail.com> >>> wrote: >>> > I tried to run more than one task manager per node by duplicating the >>> slave >>> > IPs. At startup it says for example, >>> > >>> > [INFO] 1 instance(s) of taskmanager are already running on j-011. >>> > Starting taskmanager daemon on host j-011. >>> > >>> > but I only see 1 task manager process running. >>> > >>> > Is there anything else I need to do? >>> > >>> > On Sun, Jul 3, 2016 at 11:28 AM, Ufuk Celebi <u...@apache.org> wrote: >>> >> >>> >> Yes, exactly. >>> >> >>> >> On Sat, Jul 2, 2016 at 6:28 PM, Saliya Ekanayake <esal...@gmail.com> >>> >> wrote: >>> >> > Thank you, yes, it can be done externally, if not supported within >>> >> > Flink. >>> >> > >>> >> > So the way to spawn multiple task managers would be to list the same >>> >> > slave >>> >> > machines N times as necessary in the slaves file? >>> >> > >>> >> > On Sat, Jul 2, 2016 at 11:22 AM, Ufuk Celebi <u...@apache.org> >>> wrote: >>> >> >> >>> >> >> No, not inside of Flink. That sounds like something like the OS or >>> >> >> resource manager should handle. >>> >> >> >>> >> >> On Sat, Jul 2, 2016 at 5:12 PM, Saliya Ekanayake < >>> esal...@gmail.com> >>> >> >> wrote: >>> >> >> > That's great, so is there support to pin task managers to >>> sockets as >>> >> >> > well? >>> >> >> > >>> >> >> > On Sat, Jul 2, 2016 at 11:08 AM, Ufuk Celebi <u...@apache.org> >>> wrote: >>> >> >> >> >>> >> >> >> Regarding 2) if you don't manually configure something else, >>> that >>> >> >> >> should happen always. >>> >> >> >> >>> >> >> >> Yes, you can run more than one task manager per node depending >>> on >>> >> >> >> the >>> >> >> >> process isolation you want. Within a task manager, there are >>> >> >> >> multiple >>> >> >> >> threads for each slot. For example, if you have 2 task managers >>> with >>> >> >> >> 2 >>> >> >> >> slots each and submit a job with parallelism 4, each task >>> manager >>> >> >> >> will >>> >> >> >> execute 2 sub tasks in separate Threads. >>> >> >> >> >>> >> >> >> >>> >> >> >> On Sat, Jul 2, 2016 at 3:26 AM, Saliya Ekanayake < >>> esal...@gmail.com> >>> >> >> >> wrote: >>> >> >> >> > Hi Ufuk, >>> >> >> >> > >>> >> >> >> > Looking at the document you sent it seems only 1 task manager >>> per >>> >> >> >> > node >>> >> >> >> > exist >>> >> >> >> > and within that you have multiple slots. Is it possible to run >>> >> >> >> > more >>> >> >> >> > than >>> >> >> >> > 1 >>> >> >> >> > task manager per node? Also, within a task manager is the >>> >> >> >> > parallelism >>> >> >> >> > done >>> >> >> >> > through threads or processes? >>> >> >> >> > >>> >> >> >> > Thank you, >>> >> >> >> > Saliya >>> >> >> >> > >>> >> >> >> > On Thu, Jun 30, 2016 at 5:27 PM, Saliya Ekanayake >>> >> >> >> > <esal...@gmail.com> >>> >> >> >> > wrote: >>> >> >> >> >> >>> >> >> >> >> Thank you, I'll check these. >>> >> >> >> >> >>> >> >> >> >> In 2.) you said they are likely to exchange through memory. >>> Is >>> >> >> >> >> there >>> >> >> >> >> a >>> >> >> >> >> case why they wouldn't? >>> >> >> >> >> >>> >> >> >> >> On Thu, Jun 30, 2016 at 5:03 AM, Ufuk Celebi <u...@apache.org >>> > >>> >> >> >> >> wrote: >>> >> >> >> >>> >>> >> >> >> >>> On Thu, Jun 30, 2016 at 1:44 AM, Saliya Ekanayake >>> >> >> >> >>> <esal...@gmail.com> >>> >> >> >> >>> wrote: >>> >> >> >> >>> > 1. What parameters are available to control parallelism >>> within >>> >> >> >> >>> > a >>> >> >> >> >>> > node? >>> >> >> >> >>> >>> >> >> >> >>> Task Manager processing slots: >>> >> >> >> >>> >>> >> >> >> >>> >>> >> >> >> >>> >>> >> >> >> >>> >>> >> >> >> >>> >>> https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#configuring-taskmanager-processing-slots >>> >> >> >> >>> >>> >> >> >> >>> > 2. Does Flink support shared memory-based messaging >>> within a >>> >> >> >> >>> > node >>> >> >> >> >>> > (without >>> >> >> >> >>> > doing TCP calls)? >>> >> >> >> >>> >>> >> >> >> >>> Yes, local exchanges happen via memory and not TCP, for >>> example >>> >> >> >> >>> if >>> >> >> >> >>> you >>> >> >> >> >>> have a map-reduce, map subtask 1 and reduce subtask 1 are >>> likely >>> >> >> >> >>> to >>> >> >> >> >>> exchange data locally. >>> >> >> >> >>> >>> >> >> >> >>> > 3. Is there support for Infiniband interconnect? >>> >> >> >> >>> >>> >> >> >> >>> No, not that I'm aware of. >>> >> >> >> >>> >>> >> >> >> >>> – Ufuk >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> -- >>> >> >> >> >> Saliya Ekanayake >>> >> >> >> >> Ph.D. Candidate | Research Assistant >>> >> >> >> >> School of Informatics and Computing | Digital Science Center >>> >> >> >> >> Indiana University, Bloomington >>> >> >> >> >> >>> >> >> >> > >>> >> >> >> > >>> >> >> >> > >>> >> >> >> > -- >>> >> >> >> > Saliya Ekanayake >>> >> >> >> > Ph.D. Candidate | Research Assistant >>> >> >> >> > School of Informatics and Computing | Digital Science Center >>> >> >> >> > Indiana University, Bloomington >>> >> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > -- >>> >> >> > Saliya Ekanayake >>> >> >> > Ph.D. Candidate | Research Assistant >>> >> >> > School of Informatics and Computing | Digital Science Center >>> >> >> > Indiana University, Bloomington >>> >> >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> >> > -- >>> >> > Saliya Ekanayake >>> >> > Ph.D. Candidate | Research Assistant >>> >> > School of Informatics and Computing | Digital Science Center >>> >> > Indiana University, Bloomington >>> >> > >>> > >>> > >>> > >>> > >>> > -- >>> > Saliya Ekanayake >>> > Ph.D. Candidate | Research Assistant >>> > School of Informatics and Computing | Digital Science Center >>> > Indiana University, Bloomington >>> > >>> >> >> >> >> -- >> Saliya Ekanayake >> Ph.D. Candidate | Research Assistant >> School of Informatics and Computing | Digital Science Center >> Indiana University, Bloomington >> >> > -- Saliya Ekanayake Ph.D. Candidate | Research Assistant School of Informatics and Computing | Digital Science Center Indiana University, Bloomington