Hi Guangya, I am facing one more issue. If I try to schedule the tasks from some external client system running the same cli mesos-execute. The tasks are not getting launched. The tasks reach the Master and it just drops the requests, below are the logs related to that
I1005 11:33:35.025594 21369 master.cpp:2250] Subscribing framework with checkpointing disabled and capabilities [ ] E1005 11:33:35.026100 21373 process.cpp:1912] Failed to shutdown socket with fd 14: Transport endpoint is not connected I1005 11:33:35.026129 21372 hierarchical.hpp:515] Added framework 77539063-89ce-4efa-a20b-ca788abbd912-0055 I1005 11:33:35.026298 21369 master.cpp:1119] Framework 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at [email protected]:47259 disconnected I1005 11:33:35.026329 21369 master.cpp:2475] Disconnecting framework 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at [email protected]:47259 I1005 11:33:35.026340 21369 master.cpp:2499] Deactivating framework 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at [email protected]:47259 E1005 11:33:35.026345 21373 process.cpp:1912] Failed to shutdown socket with fd 14: Transport endpoint is not connected I1005 11:33:35.026376 21369 master.cpp:1143] Giving framework 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at [email protected]:47259 0ns to failover I1005 11:33:35.026743 21372 hierarchical.hpp:599] Deactivated framework 77539063-89ce-4efa-a20b-ca788abbd912-0055 W1005 11:33:35.026757 21368 master.cpp:4828] Master returning resources offered to framework 77539063-89ce-4efa-a20b-ca788abbd912-0055 because the framework has terminated or is inactive I1005 11:33:35.027014 21371 hierarchical.hpp:1103] Recovered cpus(*):8; mem(*):14868; disk(*):218835; ports(*):[31000-32000] (total: cpus(*):8; mem(*):14868; disk(*):218835; ports(*):[31000-32000], allocated: ) on slave 77539063-89ce-4efa-a20b-ca788abbd912-S2 from framework 77539063-89ce-4efa-a20b-ca788abbd912-0055 I1005 11:33:35.027159 21371 hierarchical.hpp:1103] Recovered cpus(*):8; mem(*):14930; disk(*):218578; ports(*):[31000-32000] (total: cpus(*):8; mem(*):14930; disk(*):218578; ports(*):[31000-32000], allocated: ) on slave 77539063-89ce-4efa-a20b-ca788abbd912-S1 from framework 77539063-89ce-4efa-a20b-ca788abbd912-0055 I1005 11:33:35.027668 21366 master.cpp:4815] Framework failover timeout, removing framework 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at [email protected]:47259 I1005 11:33:35.027715 21366 master.cpp:5571] Removing framework 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at [email protected]:47259 Can you please tell me what is the reason? The client is in the same network as well. But it does not run any master or slave processes. Thanks & Regards, Pradeeep On 5 October 2015 at 12:13, Guangya Liu <[email protected]> wrote: > Hi Pradeep, > > Glad it finally works! Not sure if you are using systemd.slice or not, are > you running to this issue: > https://issues.apache.org/jira/browse/MESOS-1195 > > Hope Jie Yu can give you some help on this ;-) > > Thanks, > > Guangya > > On Mon, Oct 5, 2015 at 5:25 PM, Pradeep Kiruvale < > [email protected]> wrote: > >> Hi Guangya, >> >> >> Thanks for sharing the information. >> >> Now I could launch the tasks. The problem was with the permission. If I >> start all the slaves and Master as root it works fine. >> Else I have problem with launching the tasks. >> >> But on one of the slave I could not launch the slave as root, I am facing >> the following issue. >> >> Failed to create a containerizer: Could not create MesosContainerizer: >> Failed to create launcher: Failed to create Linux launcher: Failed to mount >> cgroups hierarchy at '/sys/fs/cgroup/freezer': 'freezer' is already >> attached to another hierarchy >> >> I took that out from the cluster for now. The tasks are getting scheduled >> on the other two slave nodes. >> >> Thanks for your timely help >> >> -Pradeep >> >> On 5 October 2015 at 10:54, Guangya Liu <[email protected]> wrote: >> >>> Hi Pradeep, >>> >>> My steps was pretty simple just as >>> https://github.com/apache/mesos/blob/master/docs/getting-started.md#examples >>> >>> On Master node: root@mesos1:~/src/mesos/m1/mesos/build# GLOG_v=1 >>> ./bin/mesos-master.sh --ip=192.168.0.107 --work_dir=/var/lib/mesos >>> On 3 Slave node: root@mesos007:~/src/mesos/m1/mesos/build# GLOG_v=1 >>> ./bin/mesos-slave.sh --master=192.168.0.107:5050 >>> >>> Then schedule a task on any of the node, here I was using slave node >>> mesos007, you can see that the two tasks was launched on different host. >>> >>> root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute --master= >>> 192.168.0.107:5050 --name="cluster-test" --command="/bin/sleep 100" >>> --resources="cpus(*):1;mem(*):256" >>> I1005 16:49:11.013432 2971 sched.cpp:164] Version: 0.26.0 >>> I1005 16:49:11.027802 2992 sched.cpp:262] New master detected at >>> [email protected]:5050 >>> I1005 16:49:11.029579 2992 sched.cpp:272] No credentials provided. >>> Attempting to register without authentication >>> I1005 16:49:11.038182 2985 sched.cpp:641] Framework registered with >>> c0e5fdde-595e-4768-9d04-25901d4523b6-0002 >>> Framework registered with c0e5fdde-595e-4768-9d04-25901d4523b6-0002 >>> task cluster-test submitted to slave >>> c0e5fdde-595e-4768-9d04-25901d4523b6-S0 <<<<<<<<<<<<<<<<<< >>> Received status update TASK_RUNNING for task cluster-test >>> ^C >>> root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute --master= >>> 192.168.0.107:5050 --name="cluster-test" --command="/bin/sleep 100" >>> --resources="cpus(*):1;mem(*):256" >>> I1005 16:50:18.346984 3036 sched.cpp:164] Version: 0.26.0 >>> I1005 16:50:18.366114 3055 sched.cpp:262] New master detected at >>> [email protected]:5050 >>> I1005 16:50:18.368010 3055 sched.cpp:272] No credentials provided. >>> Attempting to register without authentication >>> I1005 16:50:18.376338 3056 sched.cpp:641] Framework registered with >>> c0e5fdde-595e-4768-9d04-25901d4523b6-0003 >>> Framework registered with c0e5fdde-595e-4768-9d04-25901d4523b6-0003 >>> task cluster-test submitted to slave >>> c0e5fdde-595e-4768-9d04-25901d4523b6-S1 <<<<<<<<<<<<<<<<<<<< >>> Received status update TASK_RUNNING for task cluster-test >>> >>> Thanks, >>> >>> Guangya >>> >>> On Mon, Oct 5, 2015 at 4:21 PM, Pradeep Kiruvale < >>> [email protected]> wrote: >>> >>>> Hi Guangya, >>>> >>>> Thanks for your reply. >>>> >>>> I just want to know how did you launch the tasks. >>>> >>>> 1. What processes you have started on Master? >>>> 2. What are the processes you have started on Slaves? >>>> >>>> I am missing something here, otherwise all my slave have enough memory >>>> and cpus to launch the tasks I mentioned. >>>> What I am missing is some configuration steps. >>>> >>>> Thanks & Regards, >>>> Pradeep >>>> >>>> >>>> On 3 October 2015 at 13:14, Guangya Liu <[email protected]> wrote: >>>> >>>>> Hi Pradeep, >>>>> >>>>> I did some test with your case and found that the task can run >>>>> randomly on the three slave hosts, every time may have different result. >>>>> The logic is here: >>>>> https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/hierarchical.hpp#L1263-#L1266 >>>>> The allocator will help random shuffle the slaves every time when >>>>> allocate resources for offers. >>>>> >>>>> I see that every of your task need the minimum resources as " >>>>> resources="cpus(*):3;mem(*):2560", can you help check if all of your >>>>> slaves have enough resources? If you want your task run on other slaves, >>>>> then those slaves need to have at least 3 cpus and 2550M memory. >>>>> >>>>> Thanks >>>>> >>>>> On Fri, Oct 2, 2015 at 9:26 PM, Pradeep Kiruvale < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi Ondrej, >>>>>> >>>>>> Thanks for your reply >>>>>> >>>>>> I did solve that issue, yes you are right there was an issue with >>>>>> slave IP address setting. >>>>>> >>>>>> Now I am facing issue with the scheduling the tasks. When I try to >>>>>> schedule a task using >>>>>> >>>>>> /src/mesos-execute --master=192.168.0.102:5050 --name="cluster-test" >>>>>> --command="/usr/bin/hackbench -s 4096 -l 10845760 -g 2 -f 2 -P" >>>>>> --resources="cpus(*):3;mem(*):2560" >>>>>> >>>>>> The tasks always get scheduled on the same node. The resources from >>>>>> the other nodes are not getting used to schedule the tasks. >>>>>> >>>>>> I just start the mesos slaves like below >>>>>> >>>>>> ./bin/mesos-slave.sh --master=192.168.0.102:5050/mesos >>>>>> --hostname=slave1 >>>>>> >>>>>> If I submit the task using the above (mesos-execute) command from >>>>>> same as one of the slave it runs on that system. >>>>>> >>>>>> But when I submit the task from some different system. It uses just >>>>>> that system and queues the tasks not runs on the other slaves. >>>>>> Some times I see the message "Failed to getgid: unknown user" >>>>>> >>>>>> Do I need to start some process to push the task on all the slaves >>>>>> equally? Am I missing something here? >>>>>> >>>>>> Regards, >>>>>> Pradeep >>>>>> >>>>>> >>>>>> >>>>>> On 2 October 2015 at 15:07, Ondrej Smola <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi Pradeep, >>>>>>> >>>>>>> the problem is with IP your slave advertise - mesos by default >>>>>>> resolves your hostname - there are several solutions (let say your >>>>>>> node ip >>>>>>> is 192.168.56.128) >>>>>>> >>>>>>> 1) export LIBPROCESS_IP=192.168.56.128 >>>>>>> 2) set mesos options - ip, hostname >>>>>>> >>>>>>> one way to do this is to create files >>>>>>> >>>>>>> echo "192.168.56.128" > /etc/mesos-slave/ip >>>>>>> echo "abc.mesos.com" > /etc/mesos-slave/hostname >>>>>>> >>>>>>> for more configuration options see >>>>>>> http://mesos.apache.org/documentation/latest/configuration >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> 2015-10-02 10:06 GMT+02:00 Pradeep Kiruvale < >>>>>>> [email protected]>: >>>>>>> >>>>>>>> Hi Guangya, >>>>>>>> >>>>>>>> Thanks for reply. I found one interesting log message. >>>>>>>> >>>>>>>> 7410 master.cpp:5977] Removed slave >>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S52 (192.168.0.178): a new slave >>>>>>>> registered at the same address >>>>>>>> >>>>>>>> Mostly because of this issue, the systems/slave nodes are getting >>>>>>>> registered and de-registered to make a room for the next node. I can >>>>>>>> even >>>>>>>> see this on >>>>>>>> the UI interface, for some time one node got added and after some >>>>>>>> time that will be replaced with the new slave node. >>>>>>>> >>>>>>>> The above log is followed by the below log messages. >>>>>>>> >>>>>>>> >>>>>>>> I1002 10:01:12.753865 7416 leveldb.cpp:343] Persisting action (18 >>>>>>>> bytes) to leveldb took 104089ns >>>>>>>> I1002 10:01:12.753885 7416 replica.cpp:679] Persisted action at 384 >>>>>>>> E1002 10:01:12.753891 7417 process.cpp:1912] Failed to shutdown >>>>>>>> socket with fd 15: Transport endpoint is not connected >>>>>>>> I1002 10:01:12.753988 7413 master.cpp:3930] Registered slave >>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051 >>>>>>>> (192.168.0.116) with cpus(*):8; mem(*):14930; disk(*):218578; >>>>>>>> ports(*):[31000-32000] >>>>>>>> I1002 10:01:12.754065 7413 master.cpp:1080] Slave >>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051 >>>>>>>> (192.168.0.116) disconnected >>>>>>>> I1002 10:01:12.754072 7416 hierarchical.hpp:675] Added slave >>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 (192.168.0.116) with >>>>>>>> cpus(*):8; >>>>>>>> mem(*):14930; disk(*):218578; ports(*):[31000-32000] (allocated: ) >>>>>>>> I1002 10:01:12.754084 7413 master.cpp:2534] Disconnecting slave >>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051 >>>>>>>> (192.168.0.116) >>>>>>>> E1002 10:01:12.754118 7417 process.cpp:1912] Failed to shutdown >>>>>>>> socket with fd 16: Transport endpoint is not connected >>>>>>>> I1002 10:01:12.754132 7413 master.cpp:2553] Deactivating slave >>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051 >>>>>>>> (192.168.0.116) >>>>>>>> I1002 10:01:12.754237 7416 hierarchical.hpp:768] Slave >>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 deactivated >>>>>>>> I1002 10:01:12.754240 7413 replica.cpp:658] Replica received >>>>>>>> learned notice for position 384 >>>>>>>> I1002 10:01:12.754360 7413 leveldb.cpp:343] Persisting action (20 >>>>>>>> bytes) to leveldb took 95171ns >>>>>>>> I1002 10:01:12.754395 7413 leveldb.cpp:401] Deleting ~2 keys from >>>>>>>> leveldb took 20333ns >>>>>>>> I1002 10:01:12.754406 7413 replica.cpp:679] Persisted action at 384 >>>>>>>> >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Pradeep >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 2 October 2015 at 02:35, Guangya Liu <[email protected]> wrote: >>>>>>>> >>>>>>>>> Hi Pradeep, >>>>>>>>> >>>>>>>>> Please check some of my questions in line. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> Guangya >>>>>>>>> >>>>>>>>> On Fri, Oct 2, 2015 at 12:55 AM, Pradeep Kiruvale < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Hi All, >>>>>>>>>> >>>>>>>>>> I am new to Mesos. I have set up a Mesos cluster with 1 Master >>>>>>>>>> and 3 Slaves. >>>>>>>>>> >>>>>>>>>> One slave runs on the Master Node itself and Other slaves run on >>>>>>>>>> different nodes. Here node means the physical boxes. >>>>>>>>>> >>>>>>>>>> I tried running the tasks by configuring one Node cluster. Tested >>>>>>>>>> the task scheduling using mesos-execute, works fine. >>>>>>>>>> >>>>>>>>>> When I configure three Node cluster (1master and 3 slaves) and >>>>>>>>>> try to see the resources on the master (in GUI) only the Master node >>>>>>>>>> resources are visible. >>>>>>>>>> The other nodes resources are not visible. Some times visible >>>>>>>>>> but in a de-actived state. >>>>>>>>>> >>>>>>>>> Can you please append some logs from mesos-slave and mesos-master? >>>>>>>>> There should be some logs in either master or slave telling you what >>>>>>>>> is >>>>>>>>> wrong. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> *Please let me know what could be the reason. All the nodes are >>>>>>>>>> in the same network. * >>>>>>>>>> >>>>>>>>>> When I try to schedule a task using >>>>>>>>>> >>>>>>>>>> /src/mesos-execute --master=192.168.0.102:5050 >>>>>>>>>> --name="cluster-test" --command="/usr/bin/hackbench -s 4096 -l >>>>>>>>>> 10845760 -g >>>>>>>>>> 2 -f 2 -P" --resources="cpus(*):3;mem(*):2560" >>>>>>>>>> >>>>>>>>>> The tasks always get scheduled on the same node. The resources >>>>>>>>>> from the other nodes are not getting used to schedule the tasks. >>>>>>>>>> >>>>>>>>> Based on your previous question, there is only one node in your >>>>>>>>> cluster, that's why other nodes are not available. We need first >>>>>>>>> identify >>>>>>>>> what is wrong with other three nodes first. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> I*s it required to register the frameworks from every slave node >>>>>>>>>> on the Master?* >>>>>>>>>> >>>>>>>>> It is not required. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> *I have configured this cluster using the git-hub code.* >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thanks & Regards, >>>>>>>>>> Pradeep >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >

