Hi,

Can you post your configuration parameters (exclude default settings) and 
cluster description?

Best,
Ovidiu
> On 11 Jul 2016, at 17:49, Saliya Ekanayake <esal...@gmail.com> wrote:
> 
> Thank you Greg, I'll check if this was the cause for my TMs to disappear.
> 
> On Mon, Jul 11, 2016 at 11:34 AM, Greg Hogan <c...@greghogan.com 
> <mailto:c...@greghogan.com>> wrote:
> The OOM killer doesn't give warning so you'll need to call dmesg or look in 
> /var/log/messages or similar. The following reports that Debian flavors may 
> use /var/log/syslog.
>   
> http://stackoverflow.com/questions/624857/finding-which-process-was-killed-by-linux-oom-killer
>  
> <http://stackoverflow.com/questions/624857/finding-which-process-was-killed-by-linux-oom-killer>
> 
> On Sun, Jul 10, 2016 at 11:55 PM, Saliya Ekanayake <esal...@gmail.com 
> <mailto:esal...@gmail.com>> wrote:
> Greg,
> 
> where did you see the OOM log as shown in this mail thread? In my case none 
> of the TaskManagers nor JobManger reports an error like this.
> 
> On Sun, Jul 10, 2016 at 8:45 PM, Greg Hogan <c...@greghogan.com 
> <mailto:c...@greghogan.com>> wrote:
> These symptoms sounds similar to what I was experiencing in the following 
> thread. Flink can have some unexpected memory usage which can result in an 
> OOM kill by the kernel, and this becomes more pronounced as the cluster size 
> grows.
>   https://www.mail-archive.com/dev@flink.apache.org/msg06346.html 
> <https://www.mail-archive.com/dev@flink.apache.org/msg06346.html>
> 
> On Fri, Jul 8, 2016 at 12:46 PM, Saliya Ekanayake <esal...@gmail.com 
> <mailto:esal...@gmail.com>> wrote:
> I checked, but JVMs didn't crash. No puppet or other services like that.
> 
> One thing I found is that things work OK when I have a smaller number of 
> slaves. For example, here I was trying to run on 16 nodes giving 2 TMs each. 
> Then I reduced it to 4 nodes each with 2 TMs, which worked.
> 
> 
> 
> On Fri, Jul 8, 2016 at 12:31 PM, Robert Metzger <rmetz...@apache.org 
> <mailto:rmetz...@apache.org>> wrote:
> Hi,
> from the TaskManager logs, I can not see anything suspicious.
> Its a bit weird that the TaskManager logs just end, without any shutdown 
> messages. Usually the TMs log some shut down stuff when they are stopping.
> Also, if they would be still running, I would expect some error messages from 
> akka about the connection status.
> So the only thing I conclude is that one of the TMs was killed by the OS or 
> the JVM crashed. Did you check if that happened?
> 
> Do you have any service like puppet that is controlling processes?
> 
> 
> On Thu, Jul 7, 2016 at 5:46 PM, Saliya Ekanayake <esal...@gmail.com 
> <mailto:esal...@gmail.com>> wrote:
> I see two logs (attached), but there's only 1 TaskManger process. Also, the 
> Web console says it can find only 1 TM. 
> 
> However, I see this part in JM log, which shows there was a second TM at one 
> point, but it was unregistered. Any thoughts?
> 
> --------------------------
> 
> - Registered TaskManager at j-002 
> (akka.tcp://flink@172.16.0.2:42888/user/taskmanager 
> <http://flink@172.16.0.2:42888/user/taskmanager>) as 
> 1c65415701f19978c8a8cdc75c993717. Current number of registered hosts is 1. 
> Current number of alive task slots is 12.
> 
> 2016-07-07 11:32:40,363 WARN  akka.remote.ReliableDeliverySupervisor - 
> Association with remote system [akka.tcp://flink@172.16.0.2:42888 
> <http://flink@172.16.0.2:42888/>] has failed, address is now gated for [5000] 
> ms. Reason is: [Disassociated].
> 
> 2016-07-07 11:32:42,722 INFO  
> org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at 
> j-002 (akka.tcp://flink@172.16.0.2:37373/user/taskmanager 
> <http://flink@172.16.0.2:37373/user/taskmanager>) as 
> 9c4ec66f5acbc19f7931fcae8345cd4e. Current number of registered hosts is 2. 
> Current number of alive task slots is 24.
> 
> 2016-07-07 11:33:15,316 WARN  Remoting - Tried to associate with unreachable 
> remote address [akka.tcp://flink@172.16.0.2:42888 
> <http://flink@172.16.0.2:42888/>]. Address is now gated for 5000 ms, all 
> messages to this address will be delivered to dead letters. Reason: 
> Connection refused: /172.16.0.2:42888 <http://172.16.0.2:42888/>
> 
> 2016-07-07 11:33:15,320 INFO  org.apache.flink.runtime.jobmanager.JobManager 
> - Task manager akka.tcp://flink@172.16.0.2:42888/user/taskmanager 
> <http://flink@172.16.0.2:42888/user/taskmanager> terminated.
> 2016-07-07 11:33:15,320 INFO  
> org.apache.flink.runtime.instance.InstanceManager - Unregistered task manager 
> akka.tcp://flink@172.16.0.2:42888/user/taskmanager 
> <http://flink@172.16.0.2:42888/user/taskmanager>. Number of registered task 
> managers 1. Number of available slots 12.
> 
> 
> On Thu, Jul 7, 2016 at 4:27 AM, Ufuk Celebi <u...@apache.org 
> <mailto:u...@apache.org>> wrote:
> No that should suffice. Can you check whether there are any task
> manager logs for the second TM on that machine
> (taskmanager-X-j-011.log where X is the TM number)? If yes, the task
> manager process does start up and there is another problem. If not,
> the task managers seems not to start even.
> 
> – Ufuk
> 
> On Thu, Jul 7, 2016 at 7:34 AM, Saliya Ekanayake <esal...@gmail.com 
> <mailto:esal...@gmail.com>> wrote:
> > I tried to run more than one task manager per node by duplicating the slave
> > IPs. At startup it says for example,
> >
> > [INFO] 1 instance(s) of taskmanager are already running on j-011.
> > Starting taskmanager daemon on host j-011.
> >
> > but I only see 1 task manager process running.
> >
> > Is there anything else I need to do?
> >
> > On Sun, Jul 3, 2016 at 11:28 AM, Ufuk Celebi <u...@apache.org 
> > <mailto:u...@apache.org>> wrote:
> >>
> >> Yes, exactly.
> >>
> >> On Sat, Jul 2, 2016 at 6:28 PM, Saliya Ekanayake <esal...@gmail.com 
> >> <mailto:esal...@gmail.com>>
> >> wrote:
> >> > Thank you, yes, it can be done externally, if not supported within
> >> > Flink.
> >> >
> >> > So the way to spawn multiple task managers would be to list the same
> >> > slave
> >> > machines N times as necessary in the slaves file?
> >> >
> >> > On Sat, Jul 2, 2016 at 11:22 AM, Ufuk Celebi <u...@apache.org 
> >> > <mailto:u...@apache.org>> wrote:
> >> >>
> >> >> No, not inside of Flink. That sounds like something like the OS or
> >> >> resource manager should handle.
> >> >>
> >> >> On Sat, Jul 2, 2016 at 5:12 PM, Saliya Ekanayake <esal...@gmail.com 
> >> >> <mailto:esal...@gmail.com>>
> >> >> wrote:
> >> >> > That's great, so is there support to pin task managers to sockets as
> >> >> > well?
> >> >> >
> >> >> > On Sat, Jul 2, 2016 at 11:08 AM, Ufuk Celebi <u...@apache.org 
> >> >> > <mailto:u...@apache.org>> wrote:
> >> >> >>
> >> >> >> Regarding 2) if you don't manually configure something else, that
> >> >> >> should happen always.
> >> >> >>
> >> >> >> Yes, you can run more than one task manager per node depending on
> >> >> >> the
> >> >> >> process isolation you want. Within a task manager, there are
> >> >> >> multiple
> >> >> >> threads for each slot. For example, if you have 2 task managers with
> >> >> >> 2
> >> >> >> slots each and submit a job with parallelism 4, each task manager
> >> >> >> will
> >> >> >> execute 2 sub tasks in separate Threads.
> >> >> >>
> >> >> >>
> >> >> >> On Sat, Jul 2, 2016 at 3:26 AM, Saliya Ekanayake <esal...@gmail.com 
> >> >> >> <mailto:esal...@gmail.com>>
> >> >> >> wrote:
> >> >> >> > Hi Ufuk,
> >> >> >> >
> >> >> >> > Looking at the document you sent it seems only 1 task manager per
> >> >> >> > node
> >> >> >> > exist
> >> >> >> > and within that you have multiple slots. Is it possible to run
> >> >> >> > more
> >> >> >> > than
> >> >> >> > 1
> >> >> >> > task manager per node? Also, within a task manager is the
> >> >> >> > parallelism
> >> >> >> > done
> >> >> >> > through threads or processes?
> >> >> >> >
> >> >> >> > Thank you,
> >> >> >> > Saliya
> >> >> >> >
> >> >> >> > On Thu, Jun 30, 2016 at 5:27 PM, Saliya Ekanayake
> >> >> >> > <esal...@gmail.com <mailto:esal...@gmail.com>>
> >> >> >> > wrote:
> >> >> >> >>
> >> >> >> >> Thank you, I'll check these.
> >> >> >> >>
> >> >> >> >> In 2.) you said they are likely to exchange through memory. Is
> >> >> >> >> there
> >> >> >> >> a
> >> >> >> >> case why they wouldn't?
> >> >> >> >>
> >> >> >> >> On Thu, Jun 30, 2016 at 5:03 AM, Ufuk Celebi <u...@apache.org 
> >> >> >> >> <mailto:u...@apache.org>>
> >> >> >> >> wrote:
> >> >> >> >>>
> >> >> >> >>> On Thu, Jun 30, 2016 at 1:44 AM, Saliya Ekanayake
> >> >> >> >>> <esal...@gmail.com <mailto:esal...@gmail.com>>
> >> >> >> >>> wrote:
> >> >> >> >>> > 1. What parameters are available to control parallelism within
> >> >> >> >>> > a
> >> >> >> >>> > node?
> >> >> >> >>>
> >> >> >> >>> Task Manager processing slots:
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>> https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#configuring-taskmanager-processing-slots
> >> >> >> >>>  
> >> >> >> >>> <https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#configuring-taskmanager-processing-slots>
> >> >> >> >>>
> >> >> >> >>> > 2. Does Flink support shared memory-based messaging within a
> >> >> >> >>> > node
> >> >> >> >>> > (without
> >> >> >> >>> > doing TCP calls)?
> >> >> >> >>>
> >> >> >> >>> Yes, local exchanges happen via memory and not TCP, for example
> >> >> >> >>> if
> >> >> >> >>> you
> >> >> >> >>> have a map-reduce, map subtask 1 and reduce subtask 1 are likely
> >> >> >> >>> to
> >> >> >> >>> exchange data locally.
> >> >> >> >>>
> >> >> >> >>> > 3. Is there support for Infiniband interconnect?
> >> >> >> >>>
> >> >> >> >>> No, not that I'm aware of.
> >> >> >> >>>
> >> >> >> >>> – Ufuk
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> --
> >> >> >> >> Saliya Ekanayake
> >> >> >> >> Ph.D. Candidate | Research Assistant
> >> >> >> >> School of Informatics and Computing | Digital Science Center
> >> >> >> >> Indiana University, Bloomington
> >> >> >> >>
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> > --
> >> >> >> > Saliya Ekanayake
> >> >> >> > Ph.D. Candidate | Research Assistant
> >> >> >> > School of Informatics and Computing | Digital Science Center
> >> >> >> > Indiana University, Bloomington
> >> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Saliya Ekanayake
> >> >> > Ph.D. Candidate | Research Assistant
> >> >> > School of Informatics and Computing | Digital Science Center
> >> >> > Indiana University, Bloomington
> >> >> >
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Saliya Ekanayake
> >> > Ph.D. Candidate | Research Assistant
> >> > School of Informatics and Computing | Digital Science Center
> >> > Indiana University, Bloomington
> >> >
> >
> >
> >
> >
> > --
> > Saliya Ekanayake
> > Ph.D. Candidate | Research Assistant
> > School of Informatics and Computing | Digital Science Center
> > Indiana University, Bloomington
> >
> 
> 
> 
> -- 
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> 
> 
> 
> 
> 
> -- 
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> 
> 
> 
> 
> 
> -- 
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> 
> 
> 
> 
> 
> -- 
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> 

Reply via email to