Hi, Can you post your configuration parameters (exclude default settings) and cluster description?
Best, Ovidiu > On 11 Jul 2016, at 17:49, Saliya Ekanayake <esal...@gmail.com> wrote: > > Thank you Greg, I'll check if this was the cause for my TMs to disappear. > > On Mon, Jul 11, 2016 at 11:34 AM, Greg Hogan <c...@greghogan.com > <mailto:c...@greghogan.com>> wrote: > The OOM killer doesn't give warning so you'll need to call dmesg or look in > /var/log/messages or similar. The following reports that Debian flavors may > use /var/log/syslog. > > http://stackoverflow.com/questions/624857/finding-which-process-was-killed-by-linux-oom-killer > > <http://stackoverflow.com/questions/624857/finding-which-process-was-killed-by-linux-oom-killer> > > On Sun, Jul 10, 2016 at 11:55 PM, Saliya Ekanayake <esal...@gmail.com > <mailto:esal...@gmail.com>> wrote: > Greg, > > where did you see the OOM log as shown in this mail thread? In my case none > of the TaskManagers nor JobManger reports an error like this. > > On Sun, Jul 10, 2016 at 8:45 PM, Greg Hogan <c...@greghogan.com > <mailto:c...@greghogan.com>> wrote: > These symptoms sounds similar to what I was experiencing in the following > thread. Flink can have some unexpected memory usage which can result in an > OOM kill by the kernel, and this becomes more pronounced as the cluster size > grows. > https://www.mail-archive.com/dev@flink.apache.org/msg06346.html > <https://www.mail-archive.com/dev@flink.apache.org/msg06346.html> > > On Fri, Jul 8, 2016 at 12:46 PM, Saliya Ekanayake <esal...@gmail.com > <mailto:esal...@gmail.com>> wrote: > I checked, but JVMs didn't crash. No puppet or other services like that. > > One thing I found is that things work OK when I have a smaller number of > slaves. For example, here I was trying to run on 16 nodes giving 2 TMs each. > Then I reduced it to 4 nodes each with 2 TMs, which worked. > > > > On Fri, Jul 8, 2016 at 12:31 PM, Robert Metzger <rmetz...@apache.org > <mailto:rmetz...@apache.org>> wrote: > Hi, > from the TaskManager logs, I can not see anything suspicious. > Its a bit weird that the TaskManager logs just end, without any shutdown > messages. Usually the TMs log some shut down stuff when they are stopping. > Also, if they would be still running, I would expect some error messages from > akka about the connection status. > So the only thing I conclude is that one of the TMs was killed by the OS or > the JVM crashed. Did you check if that happened? > > Do you have any service like puppet that is controlling processes? > > > On Thu, Jul 7, 2016 at 5:46 PM, Saliya Ekanayake <esal...@gmail.com > <mailto:esal...@gmail.com>> wrote: > I see two logs (attached), but there's only 1 TaskManger process. Also, the > Web console says it can find only 1 TM. > > However, I see this part in JM log, which shows there was a second TM at one > point, but it was unregistered. Any thoughts? > > -------------------------- > > - Registered TaskManager at j-002 > (akka.tcp://flink@172.16.0.2:42888/user/taskmanager > <http://flink@172.16.0.2:42888/user/taskmanager>) as > 1c65415701f19978c8a8cdc75c993717. Current number of registered hosts is 1. > Current number of alive task slots is 12. > > 2016-07-07 11:32:40,363 WARN akka.remote.ReliableDeliverySupervisor - > Association with remote system [akka.tcp://flink@172.16.0.2:42888 > <http://flink@172.16.0.2:42888/>] has failed, address is now gated for [5000] > ms. Reason is: [Disassociated]. > > 2016-07-07 11:32:42,722 INFO > org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at > j-002 (akka.tcp://flink@172.16.0.2:37373/user/taskmanager > <http://flink@172.16.0.2:37373/user/taskmanager>) as > 9c4ec66f5acbc19f7931fcae8345cd4e. Current number of registered hosts is 2. > Current number of alive task slots is 24. > > 2016-07-07 11:33:15,316 WARN Remoting - Tried to associate with unreachable > remote address [akka.tcp://flink@172.16.0.2:42888 > <http://flink@172.16.0.2:42888/>]. Address is now gated for 5000 ms, all > messages to this address will be delivered to dead letters. Reason: > Connection refused: /172.16.0.2:42888 <http://172.16.0.2:42888/> > > 2016-07-07 11:33:15,320 INFO org.apache.flink.runtime.jobmanager.JobManager > - Task manager akka.tcp://flink@172.16.0.2:42888/user/taskmanager > <http://flink@172.16.0.2:42888/user/taskmanager> terminated. > 2016-07-07 11:33:15,320 INFO > org.apache.flink.runtime.instance.InstanceManager - Unregistered task manager > akka.tcp://flink@172.16.0.2:42888/user/taskmanager > <http://flink@172.16.0.2:42888/user/taskmanager>. Number of registered task > managers 1. Number of available slots 12. > > > On Thu, Jul 7, 2016 at 4:27 AM, Ufuk Celebi <u...@apache.org > <mailto:u...@apache.org>> wrote: > No that should suffice. Can you check whether there are any task > manager logs for the second TM on that machine > (taskmanager-X-j-011.log where X is the TM number)? If yes, the task > manager process does start up and there is another problem. If not, > the task managers seems not to start even. > > – Ufuk > > On Thu, Jul 7, 2016 at 7:34 AM, Saliya Ekanayake <esal...@gmail.com > <mailto:esal...@gmail.com>> wrote: > > I tried to run more than one task manager per node by duplicating the slave > > IPs. At startup it says for example, > > > > [INFO] 1 instance(s) of taskmanager are already running on j-011. > > Starting taskmanager daemon on host j-011. > > > > but I only see 1 task manager process running. > > > > Is there anything else I need to do? > > > > On Sun, Jul 3, 2016 at 11:28 AM, Ufuk Celebi <u...@apache.org > > <mailto:u...@apache.org>> wrote: > >> > >> Yes, exactly. > >> > >> On Sat, Jul 2, 2016 at 6:28 PM, Saliya Ekanayake <esal...@gmail.com > >> <mailto:esal...@gmail.com>> > >> wrote: > >> > Thank you, yes, it can be done externally, if not supported within > >> > Flink. > >> > > >> > So the way to spawn multiple task managers would be to list the same > >> > slave > >> > machines N times as necessary in the slaves file? > >> > > >> > On Sat, Jul 2, 2016 at 11:22 AM, Ufuk Celebi <u...@apache.org > >> > <mailto:u...@apache.org>> wrote: > >> >> > >> >> No, not inside of Flink. That sounds like something like the OS or > >> >> resource manager should handle. > >> >> > >> >> On Sat, Jul 2, 2016 at 5:12 PM, Saliya Ekanayake <esal...@gmail.com > >> >> <mailto:esal...@gmail.com>> > >> >> wrote: > >> >> > That's great, so is there support to pin task managers to sockets as > >> >> > well? > >> >> > > >> >> > On Sat, Jul 2, 2016 at 11:08 AM, Ufuk Celebi <u...@apache.org > >> >> > <mailto:u...@apache.org>> wrote: > >> >> >> > >> >> >> Regarding 2) if you don't manually configure something else, that > >> >> >> should happen always. > >> >> >> > >> >> >> Yes, you can run more than one task manager per node depending on > >> >> >> the > >> >> >> process isolation you want. Within a task manager, there are > >> >> >> multiple > >> >> >> threads for each slot. For example, if you have 2 task managers with > >> >> >> 2 > >> >> >> slots each and submit a job with parallelism 4, each task manager > >> >> >> will > >> >> >> execute 2 sub tasks in separate Threads. > >> >> >> > >> >> >> > >> >> >> On Sat, Jul 2, 2016 at 3:26 AM, Saliya Ekanayake <esal...@gmail.com > >> >> >> <mailto:esal...@gmail.com>> > >> >> >> wrote: > >> >> >> > Hi Ufuk, > >> >> >> > > >> >> >> > Looking at the document you sent it seems only 1 task manager per > >> >> >> > node > >> >> >> > exist > >> >> >> > and within that you have multiple slots. Is it possible to run > >> >> >> > more > >> >> >> > than > >> >> >> > 1 > >> >> >> > task manager per node? Also, within a task manager is the > >> >> >> > parallelism > >> >> >> > done > >> >> >> > through threads or processes? > >> >> >> > > >> >> >> > Thank you, > >> >> >> > Saliya > >> >> >> > > >> >> >> > On Thu, Jun 30, 2016 at 5:27 PM, Saliya Ekanayake > >> >> >> > <esal...@gmail.com <mailto:esal...@gmail.com>> > >> >> >> > wrote: > >> >> >> >> > >> >> >> >> Thank you, I'll check these. > >> >> >> >> > >> >> >> >> In 2.) you said they are likely to exchange through memory. Is > >> >> >> >> there > >> >> >> >> a > >> >> >> >> case why they wouldn't? > >> >> >> >> > >> >> >> >> On Thu, Jun 30, 2016 at 5:03 AM, Ufuk Celebi <u...@apache.org > >> >> >> >> <mailto:u...@apache.org>> > >> >> >> >> wrote: > >> >> >> >>> > >> >> >> >>> On Thu, Jun 30, 2016 at 1:44 AM, Saliya Ekanayake > >> >> >> >>> <esal...@gmail.com <mailto:esal...@gmail.com>> > >> >> >> >>> wrote: > >> >> >> >>> > 1. What parameters are available to control parallelism within > >> >> >> >>> > a > >> >> >> >>> > node? > >> >> >> >>> > >> >> >> >>> Task Manager processing slots: > >> >> >> >>> > >> >> >> >>> > >> >> >> >>> > >> >> >> >>> > >> >> >> >>> https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#configuring-taskmanager-processing-slots > >> >> >> >>> > >> >> >> >>> <https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#configuring-taskmanager-processing-slots> > >> >> >> >>> > >> >> >> >>> > 2. Does Flink support shared memory-based messaging within a > >> >> >> >>> > node > >> >> >> >>> > (without > >> >> >> >>> > doing TCP calls)? > >> >> >> >>> > >> >> >> >>> Yes, local exchanges happen via memory and not TCP, for example > >> >> >> >>> if > >> >> >> >>> you > >> >> >> >>> have a map-reduce, map subtask 1 and reduce subtask 1 are likely > >> >> >> >>> to > >> >> >> >>> exchange data locally. > >> >> >> >>> > >> >> >> >>> > 3. Is there support for Infiniband interconnect? > >> >> >> >>> > >> >> >> >>> No, not that I'm aware of. > >> >> >> >>> > >> >> >> >>> – Ufuk > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> -- > >> >> >> >> Saliya Ekanayake > >> >> >> >> Ph.D. Candidate | Research Assistant > >> >> >> >> School of Informatics and Computing | Digital Science Center > >> >> >> >> Indiana University, Bloomington > >> >> >> >> > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> > -- > >> >> >> > Saliya Ekanayake > >> >> >> > Ph.D. Candidate | Research Assistant > >> >> >> > School of Informatics and Computing | Digital Science Center > >> >> >> > Indiana University, Bloomington > >> >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > -- > >> >> > Saliya Ekanayake > >> >> > Ph.D. Candidate | Research Assistant > >> >> > School of Informatics and Computing | Digital Science Center > >> >> > Indiana University, Bloomington > >> >> > > >> > > >> > > >> > > >> > > >> > -- > >> > Saliya Ekanayake > >> > Ph.D. Candidate | Research Assistant > >> > School of Informatics and Computing | Digital Science Center > >> > Indiana University, Bloomington > >> > > > > > > > > > > > -- > > Saliya Ekanayake > > Ph.D. Candidate | Research Assistant > > School of Informatics and Computing | Digital Science Center > > Indiana University, Bloomington > > > > > > -- > Saliya Ekanayake > Ph.D. Candidate | Research Assistant > School of Informatics and Computing | Digital Science Center > Indiana University, Bloomington > > > > > > -- > Saliya Ekanayake > Ph.D. Candidate | Research Assistant > School of Informatics and Computing | Digital Science Center > Indiana University, Bloomington > > > > > > -- > Saliya Ekanayake > Ph.D. Candidate | Research Assistant > School of Informatics and Computing | Digital Science Center > Indiana University, Bloomington > > > > > > -- > Saliya Ekanayake > Ph.D. Candidate | Research Assistant > School of Informatics and Computing | Digital Science Center > Indiana University, Bloomington >