Re: Running a task in Mesos cluster

Pradeep Kiruvale Wed, 07 Oct 2015 02:28:44 -0700

Hi Guangya,

I am running a frame work from some other physical node, which is part of
the same network. Still I am getting below messages and the framework not
getting registered.


Any idea what is the reason?

I1007 11:24:58.781914 32392 master.cpp:4815] Framework failover timeout,
removing framework 89b179d8-9fb7-4a61-ad03-a9a5525482ff-0019 (Balloon
Framework (C++)) at
[email protected]:54203
I1007 11:24:58.781968 32392 master.cpp:5571] Removing framework
89b179d8-9fb7-4a61-ad03-a9a5525482ff-0019 (Balloon Framework (C++)) at
[email protected]:54203
I1007 11:24:58.782352 32392 hierarchical.hpp:552] Removed framework
89b179d8-9fb7-4a61-ad03-a9a5525482ff-0019
E1007 11:24:58.782577 32399 process.cpp:1912] Failed to shutdown socket
with fd 13: Transport endpoint is not connected
I1007 11:24:59.699587 32396 master.cpp:2179] Received SUBSCRIBE call for
framework 'Balloon Framework (C++)' at
[email protected]:54203
I1007 11:24:59.699717 32396 master.cpp:2250] Subscribing framework Balloon
Framework (C++) with checkpointing disabled and capabilities [  ]
I1007 11:24:59.700251 32393 hierarchical.hpp:515] Added framework
89b179d8-9fb7-4a61-ad03-a9a5525482ff-0020
E1007 11:24:59.700253 32399 process.cpp:1912] Failed to shutdown socket
with fd 13: Transport endpoint is not connected


Regards,
Pradeep


On 5 October 2015 at 13:51, Guangya Liu <[email protected]> wrote:

> Hi Pradeep,
>
> I think that the problem might be caused by that you are running the lxc
> container on master node and not sure if there are any port conflict or
> what else wrong.
>
> For my case, I was running the client in a new node but not on master
> node, perhaps you can have a try to put your client on a new node but not
> on master node.
>
> Thanks,
>
> Guangya
>
>
> On Mon, Oct 5, 2015 at 7:30 PM, Pradeep Kiruvale <
> [email protected]> wrote:
>
>> Hi Guangya,
>>
>> Hmm!...That is strange in my case!
>>
>> If I run from the mesos-execute on one of the slave/master node then the
>> tasks get their resources and they get scheduled well.
>> But if I start the mesos-execute on another node which is neither
>> slave/master then I have this issue.
>>
>> I am using an lxc container on master as a client to launch the tasks.
>> This is also in the same network as master/slaves.
>> And I just launch the task as you did. But the tasks are not getting
>> scheduled.
>>
>>
>> On master the logs are same as I sent you before
>>
>> Deactivating framework 77539063-89ce-4efa-a20b-ca788abbd912-0066
>>
>> On both of the slaves I can see the below logs
>>
>> I1005 13:23:32.547987  4831 slave.cpp:1980] Asked to shut down framework
>> 77539063-89ce-4efa-a20b-ca788abbd912-0060 by [email protected]:5050
>> W1005 13:23:32.548135  4831 slave.cpp:1995] Cannot shut down unknown
>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0060
>> I1005 13:23:33.697707  4833 slave.cpp:3926] Current disk usage 3.60%. Max
>> allowed age: 6.047984349521910days
>> I1005 13:23:34.098599  4829 slave.cpp:1980] Asked to shut down framework
>> 77539063-89ce-4efa-a20b-ca788abbd912-0061 by [email protected]:5050
>> W1005 13:23:34.098740  4829 slave.cpp:1995] Cannot shut down unknown
>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0061
>> I1005 13:23:35.274569  4831 slave.cpp:1980] Asked to shut down framework
>> 77539063-89ce-4efa-a20b-ca788abbd912-0062 by [email protected]:5050
>> W1005 13:23:35.274683  4831 slave.cpp:1995] Cannot shut down unknown
>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0062
>> I1005 13:23:36.193964  4829 slave.cpp:1980] Asked to shut down framework
>> 77539063-89ce-4efa-a20b-ca788abbd912-0063 by [email protected]:5050
>> W1005 13:23:36.194090  4829 slave.cpp:1995] Cannot shut down unknown
>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0063
>> I1005 13:24:01.914788  4827 slave.cpp:1980] Asked to shut down framework
>> 77539063-89ce-4efa-a20b-ca788abbd912-0064 by [email protected]:5050
>> W1005 13:24:01.914937  4827 slave.cpp:1995] Cannot shut down unknown
>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0064
>> I1005 13:24:03.469974  4833 slave.cpp:1980] Asked to shut down framework
>> 77539063-89ce-4efa-a20b-ca788abbd912-0065 by [email protected]:5050
>> W1005 13:24:03.470118  4833 slave.cpp:1995] Cannot shut down unknown
>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0065
>> I1005 13:24:04.642654  4826 slave.cpp:1980] Asked to shut down framework
>> 77539063-89ce-4efa-a20b-ca788abbd912-0066 by [email protected]:5050
>> W1005 13:24:04.642812  4826 slave.cpp:1995] Cannot shut down unknown
>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0066
>>
>>
>>
>> On 5 October 2015 at 13:09, Guangya Liu <[email protected]> wrote:
>>
>>> Hi Pradeep,
>>>
>>> From your log, seems that the master process is exiting and this caused
>>> the framework fail over to another mesos master. Can you please show more
>>> detail for your issue reproduced steps?
>>>
>>> I did some test by running mesos-execute on a client host which does not
>>> have any mesos service and the task can schedule well.
>>>
>>> root@mesos008:~/src/mesos/m1/mesos/build# ./src/mesos-execute --master=
>>> 192.168.0.107:5050 --name="cluster-test" --command="/bin/sleep 10"
>>> --resources="cpus(*):1;mem(*):256"
>>> I1005 18:59:47.974123  1233 sched.cpp:164] Version: 0.26.0
>>> I1005 18:59:47.990890  1248 sched.cpp:262] New master detected at
>>> [email protected]:5050
>>> I1005 18:59:47.993074  1248 sched.cpp:272] No credentials provided.
>>> Attempting to register without authentication
>>> I1005 18:59:48.001194  1249 sched.cpp:641] Framework registered with
>>> 04b9af5e-e9b6-4c59-8734-eba407163922-0002
>>> Framework registered with 04b9af5e-e9b6-4c59-8734-eba407163922-0002
>>> task cluster-test submitted to slave
>>> c0e5fdde-595e-4768-9d04-25901d4523b6-S0
>>> Received status update TASK_RUNNING for task cluster-test
>>> Received status update TASK_FINISHED for task cluster-test
>>> I1005 18:59:58.431144  1249 sched.cpp:1771] Asked to stop the driver
>>> I1005 18:59:58.431591  1249 sched.cpp:1040] Stopping framework
>>> '04b9af5e-e9b6-4c59-8734-eba407163922-0002'
>>> root@mesos008:~/src/mesos/m1/mesos/build# ps -ef | grep mesos
>>> root      1259  1159  0 19:06 pts/0    00:00:00 grep --color=auto mesos
>>>
>>> Thanks,
>>>
>>> Guangya
>>>
>>>
>>> On Mon, Oct 5, 2015 at 6:50 PM, Pradeep Kiruvale <
>>> [email protected]> wrote:
>>>
>>>> Hi Guangya,
>>>>
>>>> I am facing one more issue. If I try to schedule the tasks from some
>>>> external client system running the same cli mesos-execute.
>>>> The tasks are not getting launched. The tasks reach the Master and it
>>>> just drops the requests, below are the logs related to that
>>>>
>>>> I1005 11:33:35.025594 21369 master.cpp:2250] Subscribing framework
>>>>  with checkpointing disabled and capabilities [  ]
>>>> E1005 11:33:35.026100 21373 process.cpp:1912] Failed to shutdown socket
>>>> with fd 14: Transport endpoint is not connected
>>>> I1005 11:33:35.026129 21372 hierarchical.hpp:515] Added framework
>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>> I1005 11:33:35.026298 21369 master.cpp:1119] Framework
>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>> [email protected]:47259
>>>> disconnected
>>>> I1005 11:33:35.026329 21369 master.cpp:2475] Disconnecting framework
>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>> [email protected]:47259
>>>> I1005 11:33:35.026340 21369 master.cpp:2499] Deactivating framework
>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>> [email protected]:47259
>>>> E1005 11:33:35.026345 21373 process.cpp:1912] Failed to shutdown socket
>>>> with fd 14: Transport endpoint is not connected
>>>> I1005 11:33:35.026376 21369 master.cpp:1143] Giving framework
>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>> [email protected]:47259 0ns to
>>>> failover
>>>> I1005 11:33:35.026743 21372 hierarchical.hpp:599] Deactivated framework
>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>> W1005 11:33:35.026757 21368 master.cpp:4828] Master returning resources
>>>> offered to framework 77539063-89ce-4efa-a20b-ca788abbd912-0055 because the
>>>> framework has terminated or is inactive
>>>> I1005 11:33:35.027014 21371 hierarchical.hpp:1103] Recovered cpus(*):8;
>>>> mem(*):14868; disk(*):218835; ports(*):[31000-32000] (total: cpus(*):8;
>>>> mem(*):14868; disk(*):218835; ports(*):[31000-32000], allocated: ) on slave
>>>> 77539063-89ce-4efa-a20b-ca788abbd912-S2 from framework
>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>> I1005 11:33:35.027159 21371 hierarchical.hpp:1103] Recovered cpus(*):8;
>>>> mem(*):14930; disk(*):218578; ports(*):[31000-32000] (total: cpus(*):8;
>>>> mem(*):14930; disk(*):218578; ports(*):[31000-32000], allocated: ) on slave
>>>> 77539063-89ce-4efa-a20b-ca788abbd912-S1 from framework
>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>> I1005 11:33:35.027668 21366 master.cpp:4815] Framework failover
>>>> timeout, removing framework 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>> [email protected]:47259
>>>> I1005 11:33:35.027715 21366 master.cpp:5571] Removing framework
>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>> [email protected]:47259
>>>>
>>>>
>>>> Can you please tell me what is the reason? The client is in the same
>>>> network as well. But it does not run any master or slave processes.
>>>>
>>>> Thanks & Regards,
>>>> Pradeeep
>>>>
>>>> On 5 October 2015 at 12:13, Guangya Liu <[email protected]> wrote:
>>>>
>>>>> Hi Pradeep,
>>>>>
>>>>> Glad it finally works! Not sure if you are using systemd.slice or not,
>>>>> are you running to this issue:
>>>>> https://issues.apache.org/jira/browse/MESOS-1195
>>>>>
>>>>> Hope Jie Yu can give you some help on this ;-)
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Guangya
>>>>>
>>>>> On Mon, Oct 5, 2015 at 5:25 PM, Pradeep Kiruvale <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi Guangya,
>>>>>>
>>>>>>
>>>>>> Thanks for sharing the information.
>>>>>>
>>>>>> Now I could launch the tasks. The problem was with the permission. If
>>>>>> I start all the slaves and Master as root it works fine.
>>>>>> Else I have problem with launching the tasks.
>>>>>>
>>>>>> But on one of the slave I could not launch the slave as root, I am
>>>>>> facing the following issue.
>>>>>>
>>>>>> Failed to create a containerizer: Could not create
>>>>>> MesosContainerizer: Failed to create launcher: Failed to create Linux
>>>>>> launcher: Failed to mount cgroups hierarchy at '/sys/fs/cgroup/freezer':
>>>>>> 'freezer' is already attached to another hierarchy
>>>>>>
>>>>>> I took that out from the cluster for now. The tasks are getting
>>>>>> scheduled on the other two slave nodes.
>>>>>>
>>>>>> Thanks for your timely help
>>>>>>
>>>>>> -Pradeep
>>>>>>
>>>>>> On 5 October 2015 at 10:54, Guangya Liu <[email protected]> wrote:
>>>>>>
>>>>>>> Hi Pradeep,
>>>>>>>
>>>>>>> My steps was pretty simple just as
>>>>>>> https://github.com/apache/mesos/blob/master/docs/getting-started.md#examples
>>>>>>>
>>>>>>> On Master node: root@mesos1:~/src/mesos/m1/mesos/build# GLOG_v=1
>>>>>>>  ./bin/mesos-master.sh --ip=192.168.0.107 --work_dir=/var/lib/mesos
>>>>>>> On 3 Slave node: root@mesos007:~/src/mesos/m1/mesos/build# GLOG_v=1
>>>>>>> ./bin/mesos-slave.sh --master=192.168.0.107:5050
>>>>>>>
>>>>>>> Then schedule a task on any of the node, here I was using slave node
>>>>>>> mesos007, you can see that the two tasks was launched on different host.
>>>>>>>
>>>>>>> root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute
>>>>>>> --master=192.168.0.107:5050 --name="cluster-test"
>>>>>>> --command="/bin/sleep 100" --resources="cpus(*):1;mem(*):256"
>>>>>>> I1005 16:49:11.013432  2971 sched.cpp:164] Version: 0.26.0
>>>>>>> I1005 16:49:11.027802  2992 sched.cpp:262] New master detected at
>>>>>>> [email protected]:5050
>>>>>>> I1005 16:49:11.029579  2992 sched.cpp:272] No credentials provided.
>>>>>>> Attempting to register without authentication
>>>>>>> I1005 16:49:11.038182  2985 sched.cpp:641] Framework registered with
>>>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-0002
>>>>>>> Framework registered with c0e5fdde-595e-4768-9d04-25901d4523b6-0002
>>>>>>> task cluster-test submitted to slave
>>>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-S0  <<<<<<<<<<<<<<<<<<
>>>>>>> Received status update TASK_RUNNING for task cluster-test
>>>>>>> ^C
>>>>>>> root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute
>>>>>>> --master=192.168.0.107:5050 --name="cluster-test"
>>>>>>> --command="/bin/sleep 100" --resources="cpus(*):1;mem(*):256"
>>>>>>> I1005 16:50:18.346984  3036 sched.cpp:164] Version: 0.26.0
>>>>>>> I1005 16:50:18.366114  3055 sched.cpp:262] New master detected at
>>>>>>> [email protected]:5050
>>>>>>> I1005 16:50:18.368010  3055 sched.cpp:272] No credentials provided.
>>>>>>> Attempting to register without authentication
>>>>>>> I1005 16:50:18.376338  3056 sched.cpp:641] Framework registered with
>>>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-0003
>>>>>>> Framework registered with c0e5fdde-595e-4768-9d04-25901d4523b6-0003
>>>>>>> task cluster-test submitted to slave
>>>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-S1 <<<<<<<<<<<<<<<<<<<<
>>>>>>> Received status update TASK_RUNNING for task cluster-test
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Guangya
>>>>>>>
>>>>>>> On Mon, Oct 5, 2015 at 4:21 PM, Pradeep Kiruvale <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi Guangya,
>>>>>>>>
>>>>>>>> Thanks for your reply.
>>>>>>>>
>>>>>>>> I just want to know how did you launch the tasks.
>>>>>>>>
>>>>>>>> 1. What processes you have started on Master?
>>>>>>>> 2. What are the processes you have started on Slaves?
>>>>>>>>
>>>>>>>> I am missing something here, otherwise all my slave have enough
>>>>>>>> memory and cpus to launch the tasks I mentioned.
>>>>>>>> What I am missing is some configuration steps.
>>>>>>>>
>>>>>>>> Thanks & Regards,
>>>>>>>> Pradeep
>>>>>>>>
>>>>>>>>
>>>>>>>> On 3 October 2015 at 13:14, Guangya Liu <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi Pradeep,
>>>>>>>>>
>>>>>>>>> I did some test with your case and found that the task can run
>>>>>>>>> randomly on the three slave hosts, every time may have different 
>>>>>>>>> result.
>>>>>>>>> The logic is here:
>>>>>>>>> https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/hierarchical.hpp#L1263-#L1266
>>>>>>>>> The allocator will help random shuffle the slaves every time when
>>>>>>>>> allocate resources for offers.
>>>>>>>>>
>>>>>>>>> I see that every of your task need the minimum resources as "
>>>>>>>>> resources="cpus(*):3;mem(*):2560", can you help check if all of
>>>>>>>>> your slaves have enough resources? If you want your task run on other
>>>>>>>>> slaves, then those slaves need to have at least 3 cpus and 2550M 
>>>>>>>>> memory.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>> On Fri, Oct 2, 2015 at 9:26 PM, Pradeep Kiruvale <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Ondrej,
>>>>>>>>>>
>>>>>>>>>> Thanks for your reply
>>>>>>>>>>
>>>>>>>>>> I did solve that issue, yes you are right there was an issue with
>>>>>>>>>> slave IP address setting.
>>>>>>>>>>
>>>>>>>>>> Now I am facing issue with the scheduling the tasks. When I try
>>>>>>>>>> to schedule a task using
>>>>>>>>>>
>>>>>>>>>> /src/mesos-execute --master=192.168.0.102:5050
>>>>>>>>>> --name="cluster-test" --command="/usr/bin/hackbench -s 4096 -l 
>>>>>>>>>> 10845760 -g
>>>>>>>>>> 2 -f 2 -P" --resources="cpus(*):3;mem(*):2560"
>>>>>>>>>>
>>>>>>>>>> The tasks always get scheduled on the same node. The resources
>>>>>>>>>> from the other nodes are not getting used to schedule the tasks.
>>>>>>>>>>
>>>>>>>>>>  I just start the mesos slaves like below
>>>>>>>>>>
>>>>>>>>>> ./bin/mesos-slave.sh --master=192.168.0.102:5050/mesos
>>>>>>>>>>  --hostname=slave1
>>>>>>>>>>
>>>>>>>>>> If I submit the task using the above (mesos-execute) command from
>>>>>>>>>> same as one of the slave it runs on that system.
>>>>>>>>>>
>>>>>>>>>> But when I submit the task from some different system. It uses
>>>>>>>>>> just that system and queues the tasks not runs on the other slaves.
>>>>>>>>>> Some times I see the message "Failed to getgid: unknown user"
>>>>>>>>>>
>>>>>>>>>> Do I need to start some process to push the task on all the
>>>>>>>>>> slaves equally? Am I missing something here?
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Pradeep
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 2 October 2015 at 15:07, Ondrej Smola <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Pradeep,
>>>>>>>>>>>
>>>>>>>>>>> the problem is with IP your slave advertise - mesos by default
>>>>>>>>>>> resolves your hostname - there are several solutions  (let say your 
>>>>>>>>>>> node ip
>>>>>>>>>>> is 192.168.56.128)
>>>>>>>>>>>
>>>>>>>>>>> 1)  export LIBPROCESS_IP=192.168.56.128
>>>>>>>>>>> 2)  set mesos options - ip, hostname
>>>>>>>>>>>
>>>>>>>>>>> one way to do this is to create files
>>>>>>>>>>>
>>>>>>>>>>> echo "192.168.56.128" > /etc/mesos-slave/ip
>>>>>>>>>>> echo "abc.mesos.com" > /etc/mesos-slave/hostname
>>>>>>>>>>>
>>>>>>>>>>> for more configuration options see
>>>>>>>>>>> http://mesos.apache.org/documentation/latest/configuration
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2015-10-02 10:06 GMT+02:00 Pradeep Kiruvale <
>>>>>>>>>>> [email protected]>:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Guangya,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for reply. I found one interesting log message.
>>>>>>>>>>>>
>>>>>>>>>>>>  7410 master.cpp:5977] Removed slave
>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S52 (192.168.0.178): a new 
>>>>>>>>>>>> slave
>>>>>>>>>>>> registered at the same address
>>>>>>>>>>>>
>>>>>>>>>>>> Mostly because of this issue, the systems/slave nodes are
>>>>>>>>>>>> getting registered and de-registered to make a room for the next 
>>>>>>>>>>>> node. I
>>>>>>>>>>>> can even see this on
>>>>>>>>>>>> the UI interface, for some time one node got added and after
>>>>>>>>>>>> some time that will be replaced with the new slave node.
>>>>>>>>>>>>
>>>>>>>>>>>> The above log is followed by the below log messages.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I1002 10:01:12.753865  7416 leveldb.cpp:343] Persisting action
>>>>>>>>>>>> (18 bytes) to leveldb took 104089ns
>>>>>>>>>>>> I1002 10:01:12.753885  7416 replica.cpp:679] Persisted action
>>>>>>>>>>>> at 384
>>>>>>>>>>>> E1002 10:01:12.753891  7417 process.cpp:1912] Failed to
>>>>>>>>>>>> shutdown socket with fd 15: Transport endpoint is not connected
>>>>>>>>>>>> I1002 10:01:12.753988  7413 master.cpp:3930] Registered slave
>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116) with cpus(*):8; mem(*):14930;
>>>>>>>>>>>> disk(*):218578; ports(*):[31000-32000]
>>>>>>>>>>>> I1002 10:01:12.754065  7413 master.cpp:1080] Slave
>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116) disconnected
>>>>>>>>>>>> I1002 10:01:12.754072  7416 hierarchical.hpp:675] Added slave
>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 (192.168.0.116) with 
>>>>>>>>>>>> cpus(*):8;
>>>>>>>>>>>> mem(*):14930; disk(*):218578; ports(*):[31000-32000] (allocated: )
>>>>>>>>>>>> I1002 10:01:12.754084  7413 master.cpp:2534] Disconnecting
>>>>>>>>>>>> slave 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116)
>>>>>>>>>>>> E1002 10:01:12.754118  7417 process.cpp:1912] Failed to
>>>>>>>>>>>> shutdown socket with fd 16: Transport endpoint is not connected
>>>>>>>>>>>> I1002 10:01:12.754132  7413 master.cpp:2553] Deactivating slave
>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116)
>>>>>>>>>>>> I1002 10:01:12.754237  7416 hierarchical.hpp:768] Slave
>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 deactivated
>>>>>>>>>>>> I1002 10:01:12.754240  7413 replica.cpp:658] Replica received
>>>>>>>>>>>> learned notice for position 384
>>>>>>>>>>>> I1002 10:01:12.754360  7413 leveldb.cpp:343] Persisting action
>>>>>>>>>>>> (20 bytes) to leveldb took 95171ns
>>>>>>>>>>>> I1002 10:01:12.754395  7413 leveldb.cpp:401] Deleting ~2 keys
>>>>>>>>>>>> from leveldb took 20333ns
>>>>>>>>>>>> I1002 10:01:12.754406  7413 replica.cpp:679] Persisted action
>>>>>>>>>>>> at 384
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Pradeep
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 2 October 2015 at 02:35, Guangya Liu <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Pradeep,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please check some of my questions in line.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Guangya
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Oct 2, 2015 at 12:55 AM, Pradeep Kiruvale <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am new to Mesos. I have set up a Mesos cluster with 1
>>>>>>>>>>>>>> Master and 3 Slaves.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> One slave runs on the Master Node itself and Other slaves run
>>>>>>>>>>>>>> on different nodes. Here node means the physical boxes.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I tried running the tasks by configuring one Node cluster.
>>>>>>>>>>>>>> Tested the task scheduling using mesos-execute, works fine.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> When I configure three Node cluster (1master and 3 slaves)
>>>>>>>>>>>>>> and try to see the resources on the master (in GUI) only the 
>>>>>>>>>>>>>> Master node
>>>>>>>>>>>>>> resources are visible.
>>>>>>>>>>>>>>  The other nodes resources are not visible. Some times
>>>>>>>>>>>>>> visible but in a de-actived state.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> Can you please append some logs from mesos-slave and
>>>>>>>>>>>>> mesos-master? There should be some logs in either master or slave 
>>>>>>>>>>>>> telling
>>>>>>>>>>>>> you what is wrong.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *Please let me know what could be the reason. All the nodes
>>>>>>>>>>>>>> are in the same network. *
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> When I try to schedule a task using
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> /src/mesos-execute --master=192.168.0.102:5050
>>>>>>>>>>>>>> --name="cluster-test" --command="/usr/bin/hackbench -s 4096 -l 
>>>>>>>>>>>>>> 10845760 -g
>>>>>>>>>>>>>> 2 -f 2 -P" --resources="cpus(*):3;mem(*):2560"
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The tasks always get scheduled on the same node. The
>>>>>>>>>>>>>> resources from the other nodes are not getting used to schedule 
>>>>>>>>>>>>>> the tasks.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> Based on your previous question, there is only one node in
>>>>>>>>>>>>> your cluster, that's why other nodes are not available. We need 
>>>>>>>>>>>>> first
>>>>>>>>>>>>> identify what is wrong with other three nodes first.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I*s it required to register the frameworks from every slave
>>>>>>>>>>>>>> node on the Master?*
>>>>>>>>>>>>>>
>>>>>>>>>>>>> It is not required.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *I have configured this cluster using the git-hub code.*
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>>>> Pradeep
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Running a task in Mesos cluster

Reply via email to