I checked, but JVMs didn't crash. No puppet or other services like that.

One thing I found is that things work OK when I have a smaller number of
slaves. For example, here I was trying to run on 16 nodes giving 2 TMs
each. Then I reduced it to 4 nodes each with 2 TMs, which worked.



On Fri, Jul 8, 2016 at 12:31 PM, Robert Metzger <rmetz...@apache.org> wrote:

> Hi,
> from the TaskManager logs, I can not see anything suspicious.
> Its a bit weird that the TaskManager logs just end, without any shutdown
> messages. Usually the TMs log some shut down stuff when they are stopping.
> Also, if they would be still running, I would expect some error messages
> from akka about the connection status.
> So the only thing I conclude is that one of the TMs was killed by the OS
> or the JVM crashed. Did you check if that happened?
>
> Do you have any service like puppet that is controlling processes?
>
>
> On Thu, Jul 7, 2016 at 5:46 PM, Saliya Ekanayake <esal...@gmail.com>
> wrote:
>
>> I see two logs (attached), but there's only 1 TaskManger process. Also,
>> the Web console says it can find only 1 TM.
>>
>> However, I see this part in JM log, which shows there was a second TM at
>> one point, but it was unregistered. Any thoughts?
>>
>> --------------------------
>>
>> - Registered TaskManager at j-002 (akka.tcp://
>> flink@172.16.0.2:42888/user/taskmanager) as
>> 1c65415701f19978c8a8cdc75c993717. Current number of registered hosts is 1.
>> Current number of alive task slots is 12.
>>
>> 2016-07-07 11:32:40,363 WARN  akka.remote.ReliableDeliverySupervisor -
>> Association with remote system [akka.tcp://flink@172.16.0.2:42888] has
>> failed, address is now gated for [5000] ms. Reason is: [Disassociated].
>>
>> 2016-07-07 11:32:42,722 INFO
>>  org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager
>> at j-002 (akka.tcp://flink@172.16.0.2:37373/user/taskmanager) as
>> 9c4ec66f5acbc19f7931fcae8345cd4e. Current number of registered hosts is 2.
>> Current number of alive task slots is 24.
>>
>> 2016-07-07 11:33:15,316 WARN  Remoting - Tried to associate with
>> unreachable remote address [akka.tcp://flink@172.16.0.2:42888]. Address
>> is now gated for 5000 ms, all messages to this address will be delivered to
>> dead letters. Reason: Connection refused: /172.16.0.2:42888
>>
>> 2016-07-07 11:33:15,320 INFO
>>  org.apache.flink.runtime.jobmanager.JobManager - Task manager akka.tcp://
>> flink@172.16.0.2:42888/user/taskmanager terminated.
>> 2016-07-07 11:33:15,320 INFO
>>  org.apache.flink.runtime.instance.InstanceManager - Unregistered task
>> manager akka.tcp://flink@172.16.0.2:42888/user/taskmanager. Number of
>> registered task managers 1. Number of available slots 12.
>>
>>
>> On Thu, Jul 7, 2016 at 4:27 AM, Ufuk Celebi <u...@apache.org> wrote:
>>
>>> No that should suffice. Can you check whether there are any task
>>> manager logs for the second TM on that machine
>>> (taskmanager-X-j-011.log where X is the TM number)? If yes, the task
>>> manager process does start up and there is another problem. If not,
>>> the task managers seems not to start even.
>>>
>>> – Ufuk
>>>
>>> On Thu, Jul 7, 2016 at 7:34 AM, Saliya Ekanayake <esal...@gmail.com>
>>> wrote:
>>> > I tried to run more than one task manager per node by duplicating the
>>> slave
>>> > IPs. At startup it says for example,
>>> >
>>> > [INFO] 1 instance(s) of taskmanager are already running on j-011.
>>> > Starting taskmanager daemon on host j-011.
>>> >
>>> > but I only see 1 task manager process running.
>>> >
>>> > Is there anything else I need to do?
>>> >
>>> > On Sun, Jul 3, 2016 at 11:28 AM, Ufuk Celebi <u...@apache.org> wrote:
>>> >>
>>> >> Yes, exactly.
>>> >>
>>> >> On Sat, Jul 2, 2016 at 6:28 PM, Saliya Ekanayake <esal...@gmail.com>
>>> >> wrote:
>>> >> > Thank you, yes, it can be done externally, if not supported within
>>> >> > Flink.
>>> >> >
>>> >> > So the way to spawn multiple task managers would be to list the same
>>> >> > slave
>>> >> > machines N times as necessary in the slaves file?
>>> >> >
>>> >> > On Sat, Jul 2, 2016 at 11:22 AM, Ufuk Celebi <u...@apache.org>
>>> wrote:
>>> >> >>
>>> >> >> No, not inside of Flink. That sounds like something like the OS or
>>> >> >> resource manager should handle.
>>> >> >>
>>> >> >> On Sat, Jul 2, 2016 at 5:12 PM, Saliya Ekanayake <
>>> esal...@gmail.com>
>>> >> >> wrote:
>>> >> >> > That's great, so is there support to pin task managers to
>>> sockets as
>>> >> >> > well?
>>> >> >> >
>>> >> >> > On Sat, Jul 2, 2016 at 11:08 AM, Ufuk Celebi <u...@apache.org>
>>> wrote:
>>> >> >> >>
>>> >> >> >> Regarding 2) if you don't manually configure something else,
>>> that
>>> >> >> >> should happen always.
>>> >> >> >>
>>> >> >> >> Yes, you can run more than one task manager per node depending
>>> on
>>> >> >> >> the
>>> >> >> >> process isolation you want. Within a task manager, there are
>>> >> >> >> multiple
>>> >> >> >> threads for each slot. For example, if you have 2 task managers
>>> with
>>> >> >> >> 2
>>> >> >> >> slots each and submit a job with parallelism 4, each task
>>> manager
>>> >> >> >> will
>>> >> >> >> execute 2 sub tasks in separate Threads.
>>> >> >> >>
>>> >> >> >>
>>> >> >> >> On Sat, Jul 2, 2016 at 3:26 AM, Saliya Ekanayake <
>>> esal...@gmail.com>
>>> >> >> >> wrote:
>>> >> >> >> > Hi Ufuk,
>>> >> >> >> >
>>> >> >> >> > Looking at the document you sent it seems only 1 task manager
>>> per
>>> >> >> >> > node
>>> >> >> >> > exist
>>> >> >> >> > and within that you have multiple slots. Is it possible to run
>>> >> >> >> > more
>>> >> >> >> > than
>>> >> >> >> > 1
>>> >> >> >> > task manager per node? Also, within a task manager is the
>>> >> >> >> > parallelism
>>> >> >> >> > done
>>> >> >> >> > through threads or processes?
>>> >> >> >> >
>>> >> >> >> > Thank you,
>>> >> >> >> > Saliya
>>> >> >> >> >
>>> >> >> >> > On Thu, Jun 30, 2016 at 5:27 PM, Saliya Ekanayake
>>> >> >> >> > <esal...@gmail.com>
>>> >> >> >> > wrote:
>>> >> >> >> >>
>>> >> >> >> >> Thank you, I'll check these.
>>> >> >> >> >>
>>> >> >> >> >> In 2.) you said they are likely to exchange through memory.
>>> Is
>>> >> >> >> >> there
>>> >> >> >> >> a
>>> >> >> >> >> case why they wouldn't?
>>> >> >> >> >>
>>> >> >> >> >> On Thu, Jun 30, 2016 at 5:03 AM, Ufuk Celebi <u...@apache.org
>>> >
>>> >> >> >> >> wrote:
>>> >> >> >> >>>
>>> >> >> >> >>> On Thu, Jun 30, 2016 at 1:44 AM, Saliya Ekanayake
>>> >> >> >> >>> <esal...@gmail.com>
>>> >> >> >> >>> wrote:
>>> >> >> >> >>> > 1. What parameters are available to control parallelism
>>> within
>>> >> >> >> >>> > a
>>> >> >> >> >>> > node?
>>> >> >> >> >>>
>>> >> >> >> >>> Task Manager processing slots:
>>> >> >> >> >>>
>>> >> >> >> >>>
>>> >> >> >> >>>
>>> >> >> >> >>>
>>> >> >> >> >>>
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#configuring-taskmanager-processing-slots
>>> >> >> >> >>>
>>> >> >> >> >>> > 2. Does Flink support shared memory-based messaging
>>> within a
>>> >> >> >> >>> > node
>>> >> >> >> >>> > (without
>>> >> >> >> >>> > doing TCP calls)?
>>> >> >> >> >>>
>>> >> >> >> >>> Yes, local exchanges happen via memory and not TCP, for
>>> example
>>> >> >> >> >>> if
>>> >> >> >> >>> you
>>> >> >> >> >>> have a map-reduce, map subtask 1 and reduce subtask 1 are
>>> likely
>>> >> >> >> >>> to
>>> >> >> >> >>> exchange data locally.
>>> >> >> >> >>>
>>> >> >> >> >>> > 3. Is there support for Infiniband interconnect?
>>> >> >> >> >>>
>>> >> >> >> >>> No, not that I'm aware of.
>>> >> >> >> >>>
>>> >> >> >> >>> – Ufuk
>>> >> >> >> >>
>>> >> >> >> >>
>>> >> >> >> >>
>>> >> >> >> >>
>>> >> >> >> >> --
>>> >> >> >> >> Saliya Ekanayake
>>> >> >> >> >> Ph.D. Candidate | Research Assistant
>>> >> >> >> >> School of Informatics and Computing | Digital Science Center
>>> >> >> >> >> Indiana University, Bloomington
>>> >> >> >> >>
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> > --
>>> >> >> >> > Saliya Ekanayake
>>> >> >> >> > Ph.D. Candidate | Research Assistant
>>> >> >> >> > School of Informatics and Computing | Digital Science Center
>>> >> >> >> > Indiana University, Bloomington
>>> >> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > --
>>> >> >> > Saliya Ekanayake
>>> >> >> > Ph.D. Candidate | Research Assistant
>>> >> >> > School of Informatics and Computing | Digital Science Center
>>> >> >> > Indiana University, Bloomington
>>> >> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> > --
>>> >> > Saliya Ekanayake
>>> >> > Ph.D. Candidate | Research Assistant
>>> >> > School of Informatics and Computing | Digital Science Center
>>> >> > Indiana University, Bloomington
>>> >> >
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > Saliya Ekanayake
>>> > Ph.D. Candidate | Research Assistant
>>> > School of Informatics and Computing | Digital Science Center
>>> > Indiana University, Bloomington
>>> >
>>>
>>
>>
>>
>> --
>> Saliya Ekanayake
>> Ph.D. Candidate | Research Assistant
>> School of Informatics and Computing | Digital Science Center
>> Indiana University, Bloomington
>>
>>
>


-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

Reply via email to