Thank you, Ovidiu. On Wed, Jul 13, 2016 at 3:34 PM, Ovidiu-Cristian MARCU < ovidiu-cristian.ma...@inria.fr> wrote:
> Hi, > > I would pay attention to the memory settings such that > heap+off-heap+network buffers can be served from your node’s RAM for both > TMs. > Also, there is some correlation between the number of buffers, parallelism > and your workflow’s operators. The suggestion to be used for the > numberOfBuffers does not work in every case. > > I guess the numberOfBuffers could be automatically determined based on the > parallelism and workflow’s operators, not sure how to do that. > > Best, > Ovidiu > > On 12 Jul 2016, at 21:18, Saliya Ekanayake <esal...@gmail.com> wrote: > > Hi Ovidiu, > > Checking the /var/log/messages based on Greg's response revealed TMs were > killed due to out of memory. Here's the node architecture. Each node has > 128GB of RAM. I was trying to run 2 TMs per node binding each to 12 cores > (or 1 socket). The total number of nodes were 16. I finally, managed to get > it working with the following (non-default) settings. > > taskmanager.heap.mb: 12288 > taskmanager.numberOfTaskSlots: 12 > akka.ask.timeout: 1000 s > taskmanager.network.numberOfBuffers: 36864 > > Note, the number of buffers value, this had to be higher (twice in this > case) than what's suggested in Flink (#slots-per-TM^2 * #TMs * 4, which > would be 12*12*32*4 = 18432). Otherwise, it would throw me the not enough > buffers error. > > Thank you, > Saliya > > <juliet65.png> > > On Tue, Jul 12, 2016 at 7:39 AM, Ovidiu-Cristian MARCU < > ovidiu-cristian.ma...@inria.fr> wrote: > >> Hi, >> >> Can you post your configuration parameters (exclude default settings) and >> cluster description? >> >> Best, >> Ovidiu >> >> On 11 Jul 2016, at 17:49, Saliya Ekanayake <esal...@gmail.com> wrote: >> >> Thank you Greg, I'll check if this was the cause for my TMs to disappear. >> >> On Mon, Jul 11, 2016 at 11:34 AM, Greg Hogan <c...@greghogan.com> wrote: >> >>> The OOM killer doesn't give warning so you'll need to call dmesg or look >>> in /var/log/messages or similar. The following reports that Debian flavors >>> may use /var/log/syslog. >>> >>> http://stackoverflow.com/questions/624857/finding-which-process-was-killed-by-linux-oom-killer >>> >>> On Sun, Jul 10, 2016 at 11:55 PM, Saliya Ekanayake <esal...@gmail.com> >>> wrote: >>> >>>> Greg, >>>> >>>> where did you see the OOM log as shown in this mail thread? In my case >>>> none of the TaskManagers nor JobManger reports an error like this. >>>> >>>> On Sun, Jul 10, 2016 at 8:45 PM, Greg Hogan <c...@greghogan.com> wrote: >>>> >>>>> These symptoms sounds similar to what I was experiencing in the >>>>> following thread. Flink can have some unexpected memory usage which can >>>>> result in an OOM kill by the kernel, and this becomes more pronounced as >>>>> the cluster size grows. >>>>> https://www.mail-archive.com/dev@flink.apache.org/msg06346.html >>>>> >>>>> On Fri, Jul 8, 2016 at 12:46 PM, Saliya Ekanayake <esal...@gmail.com> >>>>> wrote: >>>>> >>>>>> I checked, but JVMs didn't crash. No puppet or other services like >>>>>> that. >>>>>> >>>>>> One thing I found is that things work OK when I have a smaller number >>>>>> of slaves. For example, here I was trying to run on 16 nodes giving 2 TMs >>>>>> each. Then I reduced it to 4 nodes each with 2 TMs, which worked. >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Jul 8, 2016 at 12:31 PM, Robert Metzger <rmetz...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> from the TaskManager logs, I can not see anything suspicious. >>>>>>> Its a bit weird that the TaskManager logs just end, without any >>>>>>> shutdown messages. Usually the TMs log some shut down stuff when they >>>>>>> are >>>>>>> stopping. >>>>>>> Also, if they would be still running, I would expect some error >>>>>>> messages from akka about the connection status. >>>>>>> So the only thing I conclude is that one of the TMs was killed by >>>>>>> the OS or the JVM crashed. Did you check if that happened? >>>>>>> >>>>>>> Do you have any service like puppet that is controlling processes? >>>>>>> >>>>>>> >>>>>>> On Thu, Jul 7, 2016 at 5:46 PM, Saliya Ekanayake <esal...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> I see two logs (attached), but there's only 1 TaskManger process. >>>>>>>> Also, the Web console says it can find only 1 TM. >>>>>>>> >>>>>>>> However, I see this part in JM log, which shows there was a second >>>>>>>> TM at one point, but it was unregistered. Any thoughts? >>>>>>>> >>>>>>>> -------------------------- >>>>>>>> >>>>>>>> - Registered TaskManager at j-002 (akka.tcp:// >>>>>>>> flink@172.16.0.2:42888/user/taskmanager) as >>>>>>>> 1c65415701f19978c8a8cdc75c993717. Current number of registered hosts >>>>>>>> is 1. >>>>>>>> Current number of alive task slots is 12. >>>>>>>> >>>>>>>> 2016-07-07 11:32:40,363 WARN >>>>>>>> akka.remote.ReliableDeliverySupervisor - Association with remote >>>>>>>> system >>>>>>>> [akka.tcp://flink@172.16.0.2:42888] has failed, address is now >>>>>>>> gated for [5000] ms. Reason is: [Disassociated]. >>>>>>>> >>>>>>>> 2016-07-07 11:32:42,722 INFO >>>>>>>> org.apache.flink.runtime.instance.InstanceManager - Registered >>>>>>>> TaskManager >>>>>>>> at j-002 (akka.tcp://flink@172.16.0.2:37373/user/taskmanager) as >>>>>>>> 9c4ec66f5acbc19f7931fcae8345cd4e. Current number of registered hosts >>>>>>>> is 2. >>>>>>>> Current number of alive task slots is 24. >>>>>>>> >>>>>>>> 2016-07-07 11:33:15,316 WARN Remoting - Tried to associate with >>>>>>>> unreachable remote address [akka.tcp://flink@172.16.0.2:42888]. >>>>>>>> Address is now gated for 5000 ms, all messages to this address will be >>>>>>>> delivered to dead letters. Reason: Connection refused: / >>>>>>>> 172.16.0.2:42888 >>>>>>>> >>>>>>>> 2016-07-07 11:33:15,320 INFO >>>>>>>> org.apache.flink.runtime.jobmanager.JobManager - Task manager >>>>>>>> akka.tcp:// >>>>>>>> flink@172.16.0.2:42888/user/taskmanager terminated. >>>>>>>> 2016-07-07 11:33:15,320 INFO >>>>>>>> org.apache.flink.runtime.instance.InstanceManager - Unregistered task >>>>>>>> manager akka.tcp://flink@172.16.0.2:42888/user/taskmanager. Number >>>>>>>> of registered task managers 1. Number of available slots 12. >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Jul 7, 2016 at 4:27 AM, Ufuk Celebi <u...@apache.org> wrote: >>>>>>>> >>>>>>>>> No that should suffice. Can you check whether there are any task >>>>>>>>> manager logs for the second TM on that machine >>>>>>>>> (taskmanager-X-j-011.log where X is the TM number)? If yes, the >>>>>>>>> task >>>>>>>>> manager process does start up and there is another problem. If not, >>>>>>>>> the task managers seems not to start even. >>>>>>>>> >>>>>>>>> – Ufuk >>>>>>>>> >>>>>>>>> On Thu, Jul 7, 2016 at 7:34 AM, Saliya Ekanayake < >>>>>>>>> esal...@gmail.com> wrote: >>>>>>>>> > I tried to run more than one task manager per node by >>>>>>>>> duplicating the slave >>>>>>>>> > IPs. At startup it says for example, >>>>>>>>> > >>>>>>>>> > [INFO] 1 instance(s) of taskmanager are already running on j-011. >>>>>>>>> > Starting taskmanager daemon on host j-011. >>>>>>>>> > >>>>>>>>> > but I only see 1 task manager process running. >>>>>>>>> > >>>>>>>>> > Is there anything else I need to do? >>>>>>>>> > >>>>>>>>> > On Sun, Jul 3, 2016 at 11:28 AM, Ufuk Celebi <u...@apache.org> >>>>>>>>> wrote: >>>>>>>>> >> >>>>>>>>> >> Yes, exactly. >>>>>>>>> >> >>>>>>>>> >> On Sat, Jul 2, 2016 at 6:28 PM, Saliya Ekanayake < >>>>>>>>> esal...@gmail.com> >>>>>>>>> >> wrote: >>>>>>>>> >> > Thank you, yes, it can be done externally, if not supported >>>>>>>>> within >>>>>>>>> >> > Flink. >>>>>>>>> >> > >>>>>>>>> >> > So the way to spawn multiple task managers would be to list >>>>>>>>> the same >>>>>>>>> >> > slave >>>>>>>>> >> > machines N times as necessary in the slaves file? >>>>>>>>> >> > >>>>>>>>> >> > On Sat, Jul 2, 2016 at 11:22 AM, Ufuk Celebi <u...@apache.org> >>>>>>>>> wrote: >>>>>>>>> >> >> >>>>>>>>> >> >> No, not inside of Flink. That sounds like something like the >>>>>>>>> OS or >>>>>>>>> >> >> resource manager should handle. >>>>>>>>> >> >> >>>>>>>>> >> >> On Sat, Jul 2, 2016 at 5:12 PM, Saliya Ekanayake < >>>>>>>>> esal...@gmail.com> >>>>>>>>> >> >> wrote: >>>>>>>>> >> >> > That's great, so is there support to pin task managers to >>>>>>>>> sockets as >>>>>>>>> >> >> > well? >>>>>>>>> >> >> > >>>>>>>>> >> >> > On Sat, Jul 2, 2016 at 11:08 AM, Ufuk Celebi < >>>>>>>>> u...@apache.org> wrote: >>>>>>>>> >> >> >> >>>>>>>>> >> >> >> Regarding 2) if you don't manually configure something >>>>>>>>> else, that >>>>>>>>> >> >> >> should happen always. >>>>>>>>> >> >> >> >>>>>>>>> >> >> >> Yes, you can run more than one task manager per node >>>>>>>>> depending on >>>>>>>>> >> >> >> the >>>>>>>>> >> >> >> process isolation you want. Within a task manager, there >>>>>>>>> are >>>>>>>>> >> >> >> multiple >>>>>>>>> >> >> >> threads for each slot. For example, if you have 2 task >>>>>>>>> managers with >>>>>>>>> >> >> >> 2 >>>>>>>>> >> >> >> slots each and submit a job with parallelism 4, each task >>>>>>>>> manager >>>>>>>>> >> >> >> will >>>>>>>>> >> >> >> execute 2 sub tasks in separate Threads. >>>>>>>>> >> >> >> >>>>>>>>> >> >> >> >>>>>>>>> >> >> >> On Sat, Jul 2, 2016 at 3:26 AM, Saliya Ekanayake < >>>>>>>>> esal...@gmail.com> >>>>>>>>> >> >> >> wrote: >>>>>>>>> >> >> >> > Hi Ufuk, >>>>>>>>> >> >> >> > >>>>>>>>> >> >> >> > Looking at the document you sent it seems only 1 task >>>>>>>>> manager per >>>>>>>>> >> >> >> > node >>>>>>>>> >> >> >> > exist >>>>>>>>> >> >> >> > and within that you have multiple slots. Is it possible >>>>>>>>> to run >>>>>>>>> >> >> >> > more >>>>>>>>> >> >> >> > than >>>>>>>>> >> >> >> > 1 >>>>>>>>> >> >> >> > task manager per node? Also, within a task manager is >>>>>>>>> the >>>>>>>>> >> >> >> > parallelism >>>>>>>>> >> >> >> > done >>>>>>>>> >> >> >> > through threads or processes? >>>>>>>>> >> >> >> > >>>>>>>>> >> >> >> > Thank you, >>>>>>>>> >> >> >> > Saliya >>>>>>>>> >> >> >> > >>>>>>>>> >> >> >> > On Thu, Jun 30, 2016 at 5:27 PM, Saliya Ekanayake >>>>>>>>> >> >> >> > <esal...@gmail.com> >>>>>>>>> >> >> >> > wrote: >>>>>>>>> >> >> >> >> >>>>>>>>> >> >> >> >> Thank you, I'll check these. >>>>>>>>> >> >> >> >> >>>>>>>>> >> >> >> >> In 2.) you said they are likely to exchange through >>>>>>>>> memory. Is >>>>>>>>> >> >> >> >> there >>>>>>>>> >> >> >> >> a >>>>>>>>> >> >> >> >> case why they wouldn't? >>>>>>>>> >> >> >> >> >>>>>>>>> >> >> >> >> On Thu, Jun 30, 2016 at 5:03 AM, Ufuk Celebi < >>>>>>>>> u...@apache.org> >>>>>>>>> >> >> >> >> wrote: >>>>>>>>> >> >> >> >>> >>>>>>>>> >> >> >> >>> On Thu, Jun 30, 2016 at 1:44 AM, Saliya Ekanayake >>>>>>>>> >> >> >> >>> <esal...@gmail.com> >>>>>>>>> >> >> >> >>> wrote: >>>>>>>>> >> >> >> >>> > 1. What parameters are available to control >>>>>>>>> parallelism within >>>>>>>>> >> >> >> >>> > a >>>>>>>>> >> >> >> >>> > node? >>>>>>>>> >> >> >> >>> >>>>>>>>> >> >> >> >>> Task Manager processing slots: >>>>>>>>> >> >> >> >>> >>>>>>>>> >> >> >> >>> >>>>>>>>> >> >> >> >>> >>>>>>>>> >> >> >> >>> >>>>>>>>> >> >> >> >>> >>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#configuring-taskmanager-processing-slots >>>>>>>>> >> >> >> >>> >>>>>>>>> >> >> >> >>> > 2. Does Flink support shared memory-based messaging >>>>>>>>> within a >>>>>>>>> >> >> >> >>> > node >>>>>>>>> >> >> >> >>> > (without >>>>>>>>> >> >> >> >>> > doing TCP calls)? >>>>>>>>> >> >> >> >>> >>>>>>>>> >> >> >> >>> Yes, local exchanges happen via memory and not TCP, >>>>>>>>> for example >>>>>>>>> >> >> >> >>> if >>>>>>>>> >> >> >> >>> you >>>>>>>>> >> >> >> >>> have a map-reduce, map subtask 1 and reduce subtask 1 >>>>>>>>> are likely >>>>>>>>> >> >> >> >>> to >>>>>>>>> >> >> >> >>> exchange data locally. >>>>>>>>> >> >> >> >>> >>>>>>>>> >> >> >> >>> > 3. Is there support for Infiniband interconnect? >>>>>>>>> >> >> >> >>> >>>>>>>>> >> >> >> >>> No, not that I'm aware of. >>>>>>>>> >> >> >> >>> >>>>>>>>> >> >> >> >>> – Ufuk >>>>>>>>> >> >> >> >> >>>>>>>>> >> >> >> >> >>>>>>>>> >> >> >> >> >>>>>>>>> >> >> >> >> >>>>>>>>> >> >> >> >> -- >>>>>>>>> >> >> >> >> Saliya Ekanayake >>>>>>>>> >> >> >> >> Ph.D. Candidate | Research Assistant >>>>>>>>> >> >> >> >> School of Informatics and Computing | Digital Science >>>>>>>>> Center >>>>>>>>> >> >> >> >> Indiana University, Bloomington >>>>>>>>> >> >> >> >> >>>>>>>>> >> >> >> > >>>>>>>>> >> >> >> > >>>>>>>>> >> >> >> > >>>>>>>>> >> >> >> > -- >>>>>>>>> >> >> >> > Saliya Ekanayake >>>>>>>>> >> >> >> > Ph.D. Candidate | Research Assistant >>>>>>>>> >> >> >> > School of Informatics and Computing | Digital Science >>>>>>>>> Center >>>>>>>>> >> >> >> > Indiana University, Bloomington >>>>>>>>> >> >> >> > >>>>>>>>> >> >> > >>>>>>>>> >> >> > >>>>>>>>> >> >> > >>>>>>>>> >> >> > >>>>>>>>> >> >> > -- >>>>>>>>> >> >> > Saliya Ekanayake >>>>>>>>> >> >> > Ph.D. Candidate | Research Assistant >>>>>>>>> >> >> > School of Informatics and Computing | Digital Science >>>>>>>>> Center >>>>>>>>> >> >> > Indiana University, Bloomington >>>>>>>>> >> >> > >>>>>>>>> >> > >>>>>>>>> >> > >>>>>>>>> >> > >>>>>>>>> >> > >>>>>>>>> >> > -- >>>>>>>>> >> > Saliya Ekanayake >>>>>>>>> >> > Ph.D. Candidate | Research Assistant >>>>>>>>> >> > School of Informatics and Computing | Digital Science Center >>>>>>>>> >> > Indiana University, Bloomington >>>>>>>>> >> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > -- >>>>>>>>> > Saliya Ekanayake >>>>>>>>> > Ph.D. Candidate | Research Assistant >>>>>>>>> > School of Informatics and Computing | Digital Science Center >>>>>>>>> > Indiana University, Bloomington >>>>>>>>> > >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Saliya Ekanayake >>>>>>>> Ph.D. Candidate | Research Assistant >>>>>>>> School of Informatics and Computing | Digital Science Center >>>>>>>> Indiana University, Bloomington >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Saliya Ekanayake >>>>>> Ph.D. Candidate | Research Assistant >>>>>> School of Informatics and Computing | Digital Science Center >>>>>> Indiana University, Bloomington >>>>>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> Saliya Ekanayake >>>> Ph.D. Candidate | Research Assistant >>>> School of Informatics and Computing | Digital Science Center >>>> Indiana University, Bloomington >>>> >>>> >>> >> >> >> -- >> Saliya Ekanayake >> Ph.D. Candidate | Research Assistant >> School of Informatics and Computing | Digital Science Center >> Indiana University, Bloomington >> >> >> > > > -- > Saliya Ekanayake > Ph.D. Candidate | Research Assistant > School of Informatics and Computing | Digital Science Center > Indiana University, Bloomington > > > -- Saliya Ekanayake Ph.D. Candidate | Research Assistant School of Informatics and Computing | Digital Science Center Indiana University, Bloomington