Re: Running a task in Mesos cluster

Pradeep Kiruvale Mon, 05 Oct 2015 03:51:10 -0700

Hi Guangya,

I am facing one more issue. If I try to schedule the tasks from some
external client system running the same cli mesos-execute.
The tasks are not getting launched. The tasks reach the Master and it just
drops the requests, below are the logs related to that


I1005 11:33:35.025594 21369 master.cpp:2250] Subscribing framework  with
checkpointing disabled and capabilities [  ]
E1005 11:33:35.026100 21373 process.cpp:1912] Failed to shutdown socket
with fd 14: Transport endpoint is not connected
I1005 11:33:35.026129 21372 hierarchical.hpp:515] Added framework
77539063-89ce-4efa-a20b-ca788abbd912-0055
I1005 11:33:35.026298 21369 master.cpp:1119] Framework
77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
[email protected]:47259 disconnected
I1005 11:33:35.026329 21369 master.cpp:2475] Disconnecting framework
77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
[email protected]:47259
I1005 11:33:35.026340 21369 master.cpp:2499] Deactivating framework
77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
[email protected]:47259
E1005 11:33:35.026345 21373 process.cpp:1912] Failed to shutdown socket
with fd 14: Transport endpoint is not connected
I1005 11:33:35.026376 21369 master.cpp:1143] Giving framework
77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
[email protected]:47259 0ns to
failover
I1005 11:33:35.026743 21372 hierarchical.hpp:599] Deactivated framework
77539063-89ce-4efa-a20b-ca788abbd912-0055
W1005 11:33:35.026757 21368 master.cpp:4828] Master returning resources
offered to framework 77539063-89ce-4efa-a20b-ca788abbd912-0055 because the
framework has terminated or is inactive
I1005 11:33:35.027014 21371 hierarchical.hpp:1103] Recovered cpus(*):8;
mem(*):14868; disk(*):218835; ports(*):[31000-32000] (total: cpus(*):8;
mem(*):14868; disk(*):218835; ports(*):[31000-32000], allocated: ) on slave
77539063-89ce-4efa-a20b-ca788abbd912-S2 from framework
77539063-89ce-4efa-a20b-ca788abbd912-0055
I1005 11:33:35.027159 21371 hierarchical.hpp:1103] Recovered cpus(*):8;
mem(*):14930; disk(*):218578; ports(*):[31000-32000] (total: cpus(*):8;
mem(*):14930; disk(*):218578; ports(*):[31000-32000], allocated: ) on slave
77539063-89ce-4efa-a20b-ca788abbd912-S1 from framework
77539063-89ce-4efa-a20b-ca788abbd912-0055
I1005 11:33:35.027668 21366 master.cpp:4815] Framework failover timeout,
removing framework 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
[email protected]:47259
I1005 11:33:35.027715 21366 master.cpp:5571] Removing framework
77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
[email protected]:47259


Can you please tell me what is the reason? The client is in the same
network as well. But it does not run any master or slave processes.

Thanks & Regards,
Pradeeep

On 5 October 2015 at 12:13, Guangya Liu <[email protected]> wrote:

> Hi Pradeep,
>
> Glad it finally works! Not sure if you are using systemd.slice or not, are
> you running to this issue:
> https://issues.apache.org/jira/browse/MESOS-1195
>
> Hope Jie Yu can give you some help on this ;-)
>
> Thanks,
>
> Guangya
>
> On Mon, Oct 5, 2015 at 5:25 PM, Pradeep Kiruvale <
> [email protected]> wrote:
>
>> Hi Guangya,
>>
>>
>> Thanks for sharing the information.
>>
>> Now I could launch the tasks. The problem was with the permission. If I
>> start all the slaves and Master as root it works fine.
>> Else I have problem with launching the tasks.
>>
>> But on one of the slave I could not launch the slave as root, I am facing
>> the following issue.
>>
>> Failed to create a containerizer: Could not create MesosContainerizer:
>> Failed to create launcher: Failed to create Linux launcher: Failed to mount
>> cgroups hierarchy at '/sys/fs/cgroup/freezer': 'freezer' is already
>> attached to another hierarchy
>>
>> I took that out from the cluster for now. The tasks are getting scheduled
>> on the other two slave nodes.
>>
>> Thanks for your timely help
>>
>> -Pradeep
>>
>> On 5 October 2015 at 10:54, Guangya Liu <[email protected]> wrote:
>>
>>> Hi Pradeep,
>>>
>>> My steps was pretty simple just as
>>> https://github.com/apache/mesos/blob/master/docs/getting-started.md#examples
>>>
>>> On Master node: root@mesos1:~/src/mesos/m1/mesos/build# GLOG_v=1
>>>  ./bin/mesos-master.sh --ip=192.168.0.107 --work_dir=/var/lib/mesos
>>> On 3 Slave node: root@mesos007:~/src/mesos/m1/mesos/build# GLOG_v=1
>>> ./bin/mesos-slave.sh --master=192.168.0.107:5050
>>>
>>> Then schedule a task on any of the node, here I was using slave node
>>> mesos007, you can see that the two tasks was launched on different host.
>>>
>>> root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute --master=
>>> 192.168.0.107:5050 --name="cluster-test" --command="/bin/sleep 100"
>>> --resources="cpus(*):1;mem(*):256"
>>> I1005 16:49:11.013432  2971 sched.cpp:164] Version: 0.26.0
>>> I1005 16:49:11.027802  2992 sched.cpp:262] New master detected at
>>> [email protected]:5050
>>> I1005 16:49:11.029579  2992 sched.cpp:272] No credentials provided.
>>> Attempting to register without authentication
>>> I1005 16:49:11.038182  2985 sched.cpp:641] Framework registered with
>>> c0e5fdde-595e-4768-9d04-25901d4523b6-0002
>>> Framework registered with c0e5fdde-595e-4768-9d04-25901d4523b6-0002
>>> task cluster-test submitted to slave
>>> c0e5fdde-595e-4768-9d04-25901d4523b6-S0  <<<<<<<<<<<<<<<<<<
>>> Received status update TASK_RUNNING for task cluster-test
>>> ^C
>>> root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute --master=
>>> 192.168.0.107:5050 --name="cluster-test" --command="/bin/sleep 100"
>>> --resources="cpus(*):1;mem(*):256"
>>> I1005 16:50:18.346984  3036 sched.cpp:164] Version: 0.26.0
>>> I1005 16:50:18.366114  3055 sched.cpp:262] New master detected at
>>> [email protected]:5050
>>> I1005 16:50:18.368010  3055 sched.cpp:272] No credentials provided.
>>> Attempting to register without authentication
>>> I1005 16:50:18.376338  3056 sched.cpp:641] Framework registered with
>>> c0e5fdde-595e-4768-9d04-25901d4523b6-0003
>>> Framework registered with c0e5fdde-595e-4768-9d04-25901d4523b6-0003
>>> task cluster-test submitted to slave
>>> c0e5fdde-595e-4768-9d04-25901d4523b6-S1 <<<<<<<<<<<<<<<<<<<<
>>> Received status update TASK_RUNNING for task cluster-test
>>>
>>> Thanks,
>>>
>>> Guangya
>>>
>>> On Mon, Oct 5, 2015 at 4:21 PM, Pradeep Kiruvale <
>>> [email protected]> wrote:
>>>
>>>> Hi Guangya,
>>>>
>>>> Thanks for your reply.
>>>>
>>>> I just want to know how did you launch the tasks.
>>>>
>>>> 1. What processes you have started on Master?
>>>> 2. What are the processes you have started on Slaves?
>>>>
>>>> I am missing something here, otherwise all my slave have enough memory
>>>> and cpus to launch the tasks I mentioned.
>>>> What I am missing is some configuration steps.
>>>>
>>>> Thanks & Regards,
>>>> Pradeep
>>>>
>>>>
>>>> On 3 October 2015 at 13:14, Guangya Liu <[email protected]> wrote:
>>>>
>>>>> Hi Pradeep,
>>>>>
>>>>> I did some test with your case and found that the task can run
>>>>> randomly on the three slave hosts, every time may have different result.
>>>>> The logic is here:
>>>>> https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/hierarchical.hpp#L1263-#L1266
>>>>> The allocator will help random shuffle the slaves every time when
>>>>> allocate resources for offers.
>>>>>
>>>>> I see that every of your task need the minimum resources as "
>>>>> resources="cpus(*):3;mem(*):2560", can you help check if all of your
>>>>> slaves have enough resources? If you want your task run on other slaves,
>>>>> then those slaves need to have at least 3 cpus and 2550M memory.
>>>>>
>>>>> Thanks
>>>>>
>>>>> On Fri, Oct 2, 2015 at 9:26 PM, Pradeep Kiruvale <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi Ondrej,
>>>>>>
>>>>>> Thanks for your reply
>>>>>>
>>>>>> I did solve that issue, yes you are right there was an issue with
>>>>>> slave IP address setting.
>>>>>>
>>>>>> Now I am facing issue with the scheduling the tasks. When I try to
>>>>>> schedule a task using
>>>>>>
>>>>>> /src/mesos-execute --master=192.168.0.102:5050 --name="cluster-test"
>>>>>> --command="/usr/bin/hackbench -s 4096 -l 10845760 -g 2 -f 2 -P"
>>>>>> --resources="cpus(*):3;mem(*):2560"
>>>>>>
>>>>>> The tasks always get scheduled on the same node. The resources from
>>>>>> the other nodes are not getting used to schedule the tasks.
>>>>>>
>>>>>>  I just start the mesos slaves like below
>>>>>>
>>>>>> ./bin/mesos-slave.sh --master=192.168.0.102:5050/mesos
>>>>>>  --hostname=slave1
>>>>>>
>>>>>> If I submit the task using the above (mesos-execute) command from
>>>>>> same as one of the slave it runs on that system.
>>>>>>
>>>>>> But when I submit the task from some different system. It uses just
>>>>>> that system and queues the tasks not runs on the other slaves.
>>>>>> Some times I see the message "Failed to getgid: unknown user"
>>>>>>
>>>>>> Do I need to start some process to push the task on all the slaves
>>>>>> equally? Am I missing something here?
>>>>>>
>>>>>> Regards,
>>>>>> Pradeep
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2 October 2015 at 15:07, Ondrej Smola <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Pradeep,
>>>>>>>
>>>>>>> the problem is with IP your slave advertise - mesos by default
>>>>>>> resolves your hostname - there are several solutions  (let say your 
>>>>>>> node ip
>>>>>>> is 192.168.56.128)
>>>>>>>
>>>>>>> 1)  export LIBPROCESS_IP=192.168.56.128
>>>>>>> 2)  set mesos options - ip, hostname
>>>>>>>
>>>>>>> one way to do this is to create files
>>>>>>>
>>>>>>> echo "192.168.56.128" > /etc/mesos-slave/ip
>>>>>>> echo "abc.mesos.com" > /etc/mesos-slave/hostname
>>>>>>>
>>>>>>> for more configuration options see
>>>>>>> http://mesos.apache.org/documentation/latest/configuration
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2015-10-02 10:06 GMT+02:00 Pradeep Kiruvale <
>>>>>>> [email protected]>:
>>>>>>>
>>>>>>>> Hi Guangya,
>>>>>>>>
>>>>>>>> Thanks for reply. I found one interesting log message.
>>>>>>>>
>>>>>>>>  7410 master.cpp:5977] Removed slave
>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S52 (192.168.0.178): a new slave
>>>>>>>> registered at the same address
>>>>>>>>
>>>>>>>> Mostly because of this issue, the systems/slave nodes are getting
>>>>>>>> registered and de-registered to make a room for the next node. I can 
>>>>>>>> even
>>>>>>>> see this on
>>>>>>>> the UI interface, for some time one node got added and after some
>>>>>>>> time that will be replaced with the new slave node.
>>>>>>>>
>>>>>>>> The above log is followed by the below log messages.
>>>>>>>>
>>>>>>>>
>>>>>>>> I1002 10:01:12.753865  7416 leveldb.cpp:343] Persisting action (18
>>>>>>>> bytes) to leveldb took 104089ns
>>>>>>>> I1002 10:01:12.753885  7416 replica.cpp:679] Persisted action at 384
>>>>>>>> E1002 10:01:12.753891  7417 process.cpp:1912] Failed to shutdown
>>>>>>>> socket with fd 15: Transport endpoint is not connected
>>>>>>>> I1002 10:01:12.753988  7413 master.cpp:3930] Registered slave
>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>>>>>>> (192.168.0.116) with cpus(*):8; mem(*):14930; disk(*):218578;
>>>>>>>> ports(*):[31000-32000]
>>>>>>>> I1002 10:01:12.754065  7413 master.cpp:1080] Slave
>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>>>>>>> (192.168.0.116) disconnected
>>>>>>>> I1002 10:01:12.754072  7416 hierarchical.hpp:675] Added slave
>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 (192.168.0.116) with 
>>>>>>>> cpus(*):8;
>>>>>>>> mem(*):14930; disk(*):218578; ports(*):[31000-32000] (allocated: )
>>>>>>>> I1002 10:01:12.754084  7413 master.cpp:2534] Disconnecting slave
>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>>>>>>> (192.168.0.116)
>>>>>>>> E1002 10:01:12.754118  7417 process.cpp:1912] Failed to shutdown
>>>>>>>> socket with fd 16: Transport endpoint is not connected
>>>>>>>> I1002 10:01:12.754132  7413 master.cpp:2553] Deactivating slave
>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>>>>>>> (192.168.0.116)
>>>>>>>> I1002 10:01:12.754237  7416 hierarchical.hpp:768] Slave
>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 deactivated
>>>>>>>> I1002 10:01:12.754240  7413 replica.cpp:658] Replica received
>>>>>>>> learned notice for position 384
>>>>>>>> I1002 10:01:12.754360  7413 leveldb.cpp:343] Persisting action (20
>>>>>>>> bytes) to leveldb took 95171ns
>>>>>>>> I1002 10:01:12.754395  7413 leveldb.cpp:401] Deleting ~2 keys from
>>>>>>>> leveldb took 20333ns
>>>>>>>> I1002 10:01:12.754406  7413 replica.cpp:679] Persisted action at 384
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Pradeep
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2 October 2015 at 02:35, Guangya Liu <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi Pradeep,
>>>>>>>>>
>>>>>>>>> Please check some of my questions in line.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Guangya
>>>>>>>>>
>>>>>>>>> On Fri, Oct 2, 2015 at 12:55 AM, Pradeep Kiruvale <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hi All,
>>>>>>>>>>
>>>>>>>>>> I am new to Mesos. I have set up a Mesos cluster with 1 Master
>>>>>>>>>> and 3 Slaves.
>>>>>>>>>>
>>>>>>>>>> One slave runs on the Master Node itself and Other slaves run on
>>>>>>>>>> different nodes. Here node means the physical boxes.
>>>>>>>>>>
>>>>>>>>>> I tried running the tasks by configuring one Node cluster. Tested
>>>>>>>>>> the task scheduling using mesos-execute, works fine.
>>>>>>>>>>
>>>>>>>>>> When I configure three Node cluster (1master and 3 slaves) and
>>>>>>>>>> try to see the resources on the master (in GUI) only the Master node
>>>>>>>>>> resources are visible.
>>>>>>>>>>  The other nodes resources are not visible. Some times visible
>>>>>>>>>> but in a de-actived state.
>>>>>>>>>>
>>>>>>>>> Can you please append some logs from mesos-slave and mesos-master?
>>>>>>>>> There should be some logs in either master or slave telling you what 
>>>>>>>>> is
>>>>>>>>> wrong.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *Please let me know what could be the reason. All the nodes are
>>>>>>>>>> in the same network. *
>>>>>>>>>>
>>>>>>>>>> When I try to schedule a task using
>>>>>>>>>>
>>>>>>>>>> /src/mesos-execute --master=192.168.0.102:5050
>>>>>>>>>> --name="cluster-test" --command="/usr/bin/hackbench -s 4096 -l 
>>>>>>>>>> 10845760 -g
>>>>>>>>>> 2 -f 2 -P" --resources="cpus(*):3;mem(*):2560"
>>>>>>>>>>
>>>>>>>>>> The tasks always get scheduled on the same node. The resources
>>>>>>>>>> from the other nodes are not getting used to schedule the tasks.
>>>>>>>>>>
>>>>>>>>> Based on your previous question, there is only one node in your
>>>>>>>>> cluster, that's why other nodes are not available. We need first 
>>>>>>>>> identify
>>>>>>>>> what is wrong with other three nodes first.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I*s it required to register the frameworks from every slave node
>>>>>>>>>> on the Master?*
>>>>>>>>>>
>>>>>>>>> It is not required.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *I have configured this cluster using the git-hub code.*
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks & Regards,
>>>>>>>>>> Pradeep
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Running a task in Mesos cluster

Reply via email to