Re: Running a task in Mesos cluster

Guangya Liu Wed, 07 Oct 2015 06:39:44 -0700

Hi Pradeep,

Sorry I cannot get too much info from this log message, I see that you are
using balloon_framework, can you try mesos-execute?


Can you please add the option of GLOG_v=1 when start master and append the
whole log since the master start?

Thanks,

Guangya

On Wed, Oct 7, 2015 at 6:17 PM, Pradeep Kiruvale <[email protected]>
wrote:

> Below are the logs from Master.
>
> -Pradeep
>
> 1007 12:16:28.257853  8005 leveldb.cpp:343] Persisting action (20 bytes)
> to leveldb took 119428ns
> I1007 12:16:28.257884  8005 leveldb.cpp:401] Deleting ~2 keys from leveldb
> took 18847ns
> I1007 12:16:28.257891  8005 replica.cpp:679] Persisted action at 1440
> I1007 12:16:28.257912  8005 replica.cpp:664] Replica learned TRUNCATE
> action at position 1440
> I1007 12:16:36.666616  8002 http.cpp:336] HTTP GET for /master/state.json
> from 192.168.0.102:40721 with User-Agent='Mozilla/5.0 (X11; Linux x86_64)
> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.52 Safari/537.36'
> I1007 12:16:39.126030  8001 master.cpp:2179] Received SUBSCRIBE call for
> framework 'Balloon Framework (C++)' at
> [email protected]:58843
> I1007 12:16:39.126428  8001 master.cpp:2250] Subscribing framework Balloon
> Framework (C++) with checkpointing disabled and capabilities [  ]
> E1007 12:16:39.127459  8007 process.cpp:1912] Failed to shutdown socket
> with fd 13: Transport endpoint is not connected
> I1007 12:16:39.127535  8000 hierarchical.hpp:515] Added framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000
> I1007 12:16:39.127734  8001 master.cpp:1119] Framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000 (Balloon Framework (C++)) at
> [email protected]:58843
> disconnected
> I1007 12:16:39.127765  8001 master.cpp:2475] Disconnecting framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000 (Balloon Framework (C++)) at
> [email protected]:58843
> E1007 12:16:39.127768  8007 process.cpp:1912] Failed to shutdown socket
> with fd 14: Transport endpoint is not connected
> I1007 12:16:39.127789  8001 master.cpp:2499] Deactivating framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000 (Balloon Framework (C++)) at
> [email protected]:58843
> I1007 12:16:39.127879  8006 hierarchical.hpp:599] Deactivated framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000
> I1007 12:16:39.127913  8001 master.cpp:1143] Giving framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000 (Balloon Framework (C++)) at
> [email protected]:58843 0ns to
> failover
> I1007 12:16:39.129273  8005 master.cpp:4815] Framework failover timeout,
> removing framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000 (Balloon
> Framework (C++)) at
> [email protected]:58843
> I1007 12:16:39.129312  8005 master.cpp:5571] Removing framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000 (Balloon Framework (C++)) at
> [email protected]:58843
> I1007 12:16:39.129858  8003 hierarchical.hpp:552] Removed framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000
> I1007 12:16:40.676519  8000 master.cpp:2179] Received SUBSCRIBE call for
> framework 'Balloon Framework (C++)' at
> [email protected]:58843
> I1007 12:16:40.676678  8000 master.cpp:2250] Subscribing framework Balloon
> Framework (C++) with checkpointing disabled and capabilities [  ]
> I1007 12:16:40.677178  8006 hierarchical.hpp:515] Added framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001
> E1007 12:16:40.677217  8007 process.cpp:1912] Failed to shutdown socket
> with fd 13: Transport endpoint is not connected
> I1007 12:16:40.677409  8000 master.cpp:1119] Framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001 (Balloon Framework (C++)) at
> [email protected]:58843
> disconnected
> I1007 12:16:40.677441  8000 master.cpp:2475] Disconnecting framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001 (Balloon Framework (C++)) at
> [email protected]:58843
> I1007 12:16:40.677453  8000 master.cpp:2499] Deactivating framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001 (Balloon Framework (C++)) at
> [email protected]:58843
> E1007 12:16:40.677459  8007 process.cpp:1912] Failed to shutdown socket
> with fd 13: Transport endpoint is not connected
> I1007 12:16:40.677501  8000 master.cpp:1143] Giving framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001 (Balloon Framework (C++)) at
> [email protected]:58843 0ns to
> failover
> I1007 12:16:40.677520  8005 hierarchical.hpp:599] Deactivated framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001
> I1007 12:16:40.678864  8004 master.cpp:4815] Framework failover timeout,
> removing framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001 (Balloon
> Framework (C++)) at
> [email protected]:58843
> I1007 12:16:40.678906  8004 master.cpp:5571] Removing framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001 (Balloon Framework (C++)) at
> [email protected]:58843
> I1007 12:16:40.679147  8001 hierarchical.hpp:552] Removed framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001
> I1007 12:16:41.853121  8002 master.cpp:2179] Received SUBSCRIBE call for
> framework 'Balloon Framework (C++)' at
> [email protected]:58843
> I1007 12:16:41.853281  8002 master.cpp:2250] Subscribing framework Balloon
> Framework (C++) with checkpointing disabled and capabilities [  ]
> E1007 12:16:41.853806  8007 process.cpp:1912] Failed to shutdown socket
> with fd 13: Transport endpoint is not connected
> I1007 12:16:41.853833  8004 hierarchical.hpp:515] Added framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002
> I1007 12:16:41.854032  8002 master.cpp:1119] Framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002 (Balloon Framework (C++)) at
> [email protected]:58843
> disconnected
> I1007 12:16:41.854063  8002 master.cpp:2475] Disconnecting framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002 (Balloon Framework (C++)) at
> [email protected]:58843
> I1007 12:16:41.854076  8002 master.cpp:2499] Deactivating framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002 (Balloon Framework (C++)) at
> [email protected]:58843
> E1007 12:16:41.854080  8007 process.cpp:1912] Failed to shutdown socket
> with fd 13: Transport endpoint is not connected
> I1007 12:16:41.854126  8005 hierarchical.hpp:599] Deactivated framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002
> I1007 12:16:41.854121  8002 master.cpp:1143] Giving framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002 (Balloon Framework (C++)) at
> [email protected]:58843 0ns to
> failover
> I1007 12:16:41.855482  8006 master.cpp:4815] Framework failover timeout,
> removing framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002 (Balloon
> Framework (C++)) at
> [email protected]:58843
> I1007 12:16:41.855515  8006 master.cpp:5571] Removing framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002 (Balloon Framework (C++)) at
> [email protected]:58843
> I1007 12:16:41.855692  8001 hierarchical.hpp:552] Removed framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002
> I1007 12:16:42.772830  8000 master.cpp:2179] Received SUBSCRIBE call for
> framework 'Balloon Framework (C++)' at
> [email protected]:58843
> I1007 12:16:42.772974  8000 master.cpp:2250] Subscribing framework Balloon
> Framework (C++) with checkpointing disabled and capabilities [  ]
> I1007 12:16:42.773470  8004 hierarchical.hpp:515] Added framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003
> E1007 12:16:42.773495  8007 process.cpp:1912] Failed to shutdown socket
> with fd 13: Transport endpoint is not connected
> I1007 12:16:42.773679  8000 master.cpp:1119] Framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003 (Balloon Framework (C++)) at
> [email protected]:58843
> disconnected
> I1007 12:16:42.773697  8000 master.cpp:2475] Disconnecting framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003 (Balloon Framework (C++)) at
> [email protected]:58843
> I1007 12:16:42.773708  8000 master.cpp:2499] Deactivating framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003 (Balloon Framework (C++)) at
> [email protected]:58843
> E1007 12:16:42.773710  8007 process.cpp:1912] Failed to shutdown socket
> with fd 14: Transport endpoint is not connected
> I1007 12:16:42.773761  8000 master.cpp:1143] Giving framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003 (Balloon Framework (C++)) at
> [email protected]:58843 0ns to
> failover
> I1007 12:16:42.773779  8001 hierarchical.hpp:599] Deactivated framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003
> I1007 12:16:42.775089  8005 master.cpp:4815] Framework failover timeout,
> removing framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003 (Balloon
> Framework (C++)) at
> [email protected]:58843
> I1007 12:16:42.775126  8005 master.cpp:5571] Removing framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003 (Balloon Framework (C++)) at
> [email protected]:58843
> I1007 12:16:42.775324  8005 hierarchical.hpp:552] Removed framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003
> I1007 12:16:47.665941  8001 http.cpp:336] HTTP GET for /master/state.json
> from 192.168.0.102:40722 with User-Agent='Mozilla/5.0 (X11; Linux x86_64)
> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.52 Safari/537.36'
>
>
> On 7 October 2015 at 12:12, Guangya Liu <[email protected]> wrote:
>
>> Hi Pradeep,
>>
>> Can you please append more log for your master node? Just want to see
>> what is wrong with your master, why the framework start to failover?
>>
>> Thanks,
>>
>> Guangya
>>
>> On Wed, Oct 7, 2015 at 5:27 PM, Pradeep Kiruvale <
>> [email protected]> wrote:
>>
>>> Hi Guangya,
>>>
>>> I am running a frame work from some other physical node, which is part
>>> of the same network. Still I am getting below messages and the framework
>>> not getting registered.
>>>
>>> Any idea what is the reason?
>>>
>>> I1007 11:24:58.781914 32392 master.cpp:4815] Framework failover timeout,
>>> removing framework 89b179d8-9fb7-4a61-ad03-a9a5525482ff-0019 (Balloon
>>> Framework (C++)) at
>>> [email protected]:54203
>>> I1007 11:24:58.781968 32392 master.cpp:5571] Removing framework
>>> 89b179d8-9fb7-4a61-ad03-a9a5525482ff-0019 (Balloon Framework (C++)) at
>>> [email protected]:54203
>>> I1007 11:24:58.782352 32392 hierarchical.hpp:552] Removed framework
>>> 89b179d8-9fb7-4a61-ad03-a9a5525482ff-0019
>>> E1007 11:24:58.782577 32399 process.cpp:1912] Failed to shutdown socket
>>> with fd 13: Transport endpoint is not connected
>>> I1007 11:24:59.699587 32396 master.cpp:2179] Received SUBSCRIBE call for
>>> framework 'Balloon Framework (C++)' at
>>> [email protected]:54203
>>> I1007 11:24:59.699717 32396 master.cpp:2250] Subscribing framework
>>> Balloon Framework (C++) with checkpointing disabled and capabilities [  ]
>>> I1007 11:24:59.700251 32393 hierarchical.hpp:515] Added framework
>>> 89b179d8-9fb7-4a61-ad03-a9a5525482ff-0020
>>> E1007 11:24:59.700253 32399 process.cpp:1912] Failed to shutdown socket
>>> with fd 13: Transport endpoint is not connected
>>>
>>>
>>> Regards,
>>> Pradeep
>>>
>>>
>>> On 5 October 2015 at 13:51, Guangya Liu <[email protected]> wrote:
>>>
>>>> Hi Pradeep,
>>>>
>>>> I think that the problem might be caused by that you are running the
>>>> lxc container on master node and not sure if there are any port conflict or
>>>> what else wrong.
>>>>
>>>> For my case, I was running the client in a new node but not on master
>>>> node, perhaps you can have a try to put your client on a new node but not
>>>> on master node.
>>>>
>>>> Thanks,
>>>>
>>>> Guangya
>>>>
>>>>
>>>> On Mon, Oct 5, 2015 at 7:30 PM, Pradeep Kiruvale <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Guangya,
>>>>>
>>>>> Hmm!...That is strange in my case!
>>>>>
>>>>> If I run from the mesos-execute on one of the slave/master node then
>>>>> the tasks get their resources and they get scheduled well.
>>>>> But if I start the mesos-execute on another node which is neither
>>>>> slave/master then I have this issue.
>>>>>
>>>>> I am using an lxc container on master as a client to launch the tasks.
>>>>> This is also in the same network as master/slaves.
>>>>> And I just launch the task as you did. But the tasks are not getting
>>>>> scheduled.
>>>>>
>>>>>
>>>>> On master the logs are same as I sent you before
>>>>>
>>>>> Deactivating framework 77539063-89ce-4efa-a20b-ca788abbd912-0066
>>>>>
>>>>> On both of the slaves I can see the below logs
>>>>>
>>>>> I1005 13:23:32.547987  4831 slave.cpp:1980] Asked to shut down
>>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0060 by
>>>>> [email protected]:5050
>>>>> W1005 13:23:32.548135  4831 slave.cpp:1995] Cannot shut down unknown
>>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0060
>>>>> I1005 13:23:33.697707  4833 slave.cpp:3926] Current disk usage 3.60%.
>>>>> Max allowed age: 6.047984349521910days
>>>>> I1005 13:23:34.098599  4829 slave.cpp:1980] Asked to shut down
>>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0061 by
>>>>> [email protected]:5050
>>>>> W1005 13:23:34.098740  4829 slave.cpp:1995] Cannot shut down unknown
>>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0061
>>>>> I1005 13:23:35.274569  4831 slave.cpp:1980] Asked to shut down
>>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0062 by
>>>>> [email protected]:5050
>>>>> W1005 13:23:35.274683  4831 slave.cpp:1995] Cannot shut down unknown
>>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0062
>>>>> I1005 13:23:36.193964  4829 slave.cpp:1980] Asked to shut down
>>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0063 by
>>>>> [email protected]:5050
>>>>> W1005 13:23:36.194090  4829 slave.cpp:1995] Cannot shut down unknown
>>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0063
>>>>> I1005 13:24:01.914788  4827 slave.cpp:1980] Asked to shut down
>>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0064 by
>>>>> [email protected]:5050
>>>>> W1005 13:24:01.914937  4827 slave.cpp:1995] Cannot shut down unknown
>>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0064
>>>>> I1005 13:24:03.469974  4833 slave.cpp:1980] Asked to shut down
>>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0065 by
>>>>> [email protected]:5050
>>>>> W1005 13:24:03.470118  4833 slave.cpp:1995] Cannot shut down unknown
>>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0065
>>>>> I1005 13:24:04.642654  4826 slave.cpp:1980] Asked to shut down
>>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0066 by
>>>>> [email protected]:5050
>>>>> W1005 13:24:04.642812  4826 slave.cpp:1995] Cannot shut down unknown
>>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0066
>>>>>
>>>>>
>>>>>
>>>>> On 5 October 2015 at 13:09, Guangya Liu <[email protected]> wrote:
>>>>>
>>>>>> Hi Pradeep,
>>>>>>
>>>>>> From your log, seems that the master process is exiting and this
>>>>>> caused the framework fail over to another mesos master. Can you please 
>>>>>> show
>>>>>> more detail for your issue reproduced steps?
>>>>>>
>>>>>> I did some test by running mesos-execute on a client host which does
>>>>>> not have any mesos service and the task can schedule well.
>>>>>>
>>>>>> root@mesos008:~/src/mesos/m1/mesos/build# ./src/mesos-execute
>>>>>> --master=192.168.0.107:5050 --name="cluster-test"
>>>>>> --command="/bin/sleep 10" --resources="cpus(*):1;mem(*):256"
>>>>>> I1005 18:59:47.974123  1233 sched.cpp:164] Version: 0.26.0
>>>>>> I1005 18:59:47.990890  1248 sched.cpp:262] New master detected at
>>>>>> [email protected]:5050
>>>>>> I1005 18:59:47.993074  1248 sched.cpp:272] No credentials provided.
>>>>>> Attempting to register without authentication
>>>>>> I1005 18:59:48.001194  1249 sched.cpp:641] Framework registered with
>>>>>> 04b9af5e-e9b6-4c59-8734-eba407163922-0002
>>>>>> Framework registered with 04b9af5e-e9b6-4c59-8734-eba407163922-0002
>>>>>> task cluster-test submitted to slave
>>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-S0
>>>>>> Received status update TASK_RUNNING for task cluster-test
>>>>>> Received status update TASK_FINISHED for task cluster-test
>>>>>> I1005 18:59:58.431144  1249 sched.cpp:1771] Asked to stop the driver
>>>>>> I1005 18:59:58.431591  1249 sched.cpp:1040] Stopping framework
>>>>>> '04b9af5e-e9b6-4c59-8734-eba407163922-0002'
>>>>>> root@mesos008:~/src/mesos/m1/mesos/build# ps -ef | grep mesos
>>>>>> root      1259  1159  0 19:06 pts/0    00:00:00 grep --color=auto
>>>>>> mesos
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Guangya
>>>>>>
>>>>>>
>>>>>> On Mon, Oct 5, 2015 at 6:50 PM, Pradeep Kiruvale <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi Guangya,
>>>>>>>
>>>>>>> I am facing one more issue. If I try to schedule the tasks from some
>>>>>>> external client system running the same cli mesos-execute.
>>>>>>> The tasks are not getting launched. The tasks reach the Master and
>>>>>>> it just drops the requests, below are the logs related to that
>>>>>>>
>>>>>>> I1005 11:33:35.025594 21369 master.cpp:2250] Subscribing framework
>>>>>>>  with checkpointing disabled and capabilities [  ]
>>>>>>> E1005 11:33:35.026100 21373 process.cpp:1912] Failed to shutdown
>>>>>>> socket with fd 14: Transport endpoint is not connected
>>>>>>> I1005 11:33:35.026129 21372 hierarchical.hpp:515] Added framework
>>>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>>>>> I1005 11:33:35.026298 21369 master.cpp:1119] Framework
>>>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>>>>> [email protected]:47259
>>>>>>> disconnected
>>>>>>> I1005 11:33:35.026329 21369 master.cpp:2475] Disconnecting framework
>>>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>>>>> [email protected]:47259
>>>>>>> I1005 11:33:35.026340 21369 master.cpp:2499] Deactivating framework
>>>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>>>>> [email protected]:47259
>>>>>>> E1005 11:33:35.026345 21373 process.cpp:1912] Failed to shutdown
>>>>>>> socket with fd 14: Transport endpoint is not connected
>>>>>>> I1005 11:33:35.026376 21369 master.cpp:1143] Giving framework
>>>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>>>>> [email protected]:47259 0ns
>>>>>>> to failover
>>>>>>> I1005 11:33:35.026743 21372 hierarchical.hpp:599] Deactivated
>>>>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>>>>> W1005 11:33:35.026757 21368 master.cpp:4828] Master returning
>>>>>>> resources offered to framework 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>>>>> because the framework has terminated or is inactive
>>>>>>> I1005 11:33:35.027014 21371 hierarchical.hpp:1103] Recovered
>>>>>>> cpus(*):8; mem(*):14868; disk(*):218835; ports(*):[31000-32000] (total:
>>>>>>> cpus(*):8; mem(*):14868; disk(*):218835; ports(*):[31000-32000], 
>>>>>>> allocated:
>>>>>>> ) on slave 77539063-89ce-4efa-a20b-ca788abbd912-S2 from framework
>>>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>>>>> I1005 11:33:35.027159 21371 hierarchical.hpp:1103] Recovered
>>>>>>> cpus(*):8; mem(*):14930; disk(*):218578; ports(*):[31000-32000] (total:
>>>>>>> cpus(*):8; mem(*):14930; disk(*):218578; ports(*):[31000-32000], 
>>>>>>> allocated:
>>>>>>> ) on slave 77539063-89ce-4efa-a20b-ca788abbd912-S1 from framework
>>>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>>>>> I1005 11:33:35.027668 21366 master.cpp:4815] Framework failover
>>>>>>> timeout, removing framework 77539063-89ce-4efa-a20b-ca788abbd912-0055 
>>>>>>> () at
>>>>>>> [email protected]:47259
>>>>>>> I1005 11:33:35.027715 21366 master.cpp:5571] Removing framework
>>>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>>>>> [email protected]:47259
>>>>>>>
>>>>>>>
>>>>>>> Can you please tell me what is the reason? The client is in the same
>>>>>>> network as well. But it does not run any master or slave processes.
>>>>>>>
>>>>>>> Thanks & Regards,
>>>>>>> Pradeeep
>>>>>>>
>>>>>>> On 5 October 2015 at 12:13, Guangya Liu <[email protected]> wrote:
>>>>>>>
>>>>>>>> Hi Pradeep,
>>>>>>>>
>>>>>>>> Glad it finally works! Not sure if you are using systemd.slice or
>>>>>>>> not, are you running to this issue:
>>>>>>>> https://issues.apache.org/jira/browse/MESOS-1195
>>>>>>>>
>>>>>>>> Hope Jie Yu can give you some help on this ;-)
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Guangya
>>>>>>>>
>>>>>>>> On Mon, Oct 5, 2015 at 5:25 PM, Pradeep Kiruvale <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi Guangya,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks for sharing the information.
>>>>>>>>>
>>>>>>>>> Now I could launch the tasks. The problem was with the permission.
>>>>>>>>> If I start all the slaves and Master as root it works fine.
>>>>>>>>> Else I have problem with launching the tasks.
>>>>>>>>>
>>>>>>>>> But on one of the slave I could not launch the slave as root, I am
>>>>>>>>> facing the following issue.
>>>>>>>>>
>>>>>>>>> Failed to create a containerizer: Could not create
>>>>>>>>> MesosContainerizer: Failed to create launcher: Failed to create Linux
>>>>>>>>> launcher: Failed to mount cgroups hierarchy at 
>>>>>>>>> '/sys/fs/cgroup/freezer':
>>>>>>>>> 'freezer' is already attached to another hierarchy
>>>>>>>>>
>>>>>>>>> I took that out from the cluster for now. The tasks are getting
>>>>>>>>> scheduled on the other two slave nodes.
>>>>>>>>>
>>>>>>>>> Thanks for your timely help
>>>>>>>>>
>>>>>>>>> -Pradeep
>>>>>>>>>
>>>>>>>>> On 5 October 2015 at 10:54, Guangya Liu <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Pradeep,
>>>>>>>>>>
>>>>>>>>>> My steps was pretty simple just as
>>>>>>>>>> https://github.com/apache/mesos/blob/master/docs/getting-started.md#examples
>>>>>>>>>>
>>>>>>>>>> On Master node: root@mesos1:~/src/mesos/m1/mesos/build# GLOG_v=1
>>>>>>>>>>  ./bin/mesos-master.sh --ip=192.168.0.107 --work_dir=/var/lib/mesos
>>>>>>>>>> On 3 Slave node: root@mesos007:~/src/mesos/m1/mesos/build#
>>>>>>>>>> GLOG_v=1 ./bin/mesos-slave.sh --master=192.168.0.107:5050
>>>>>>>>>>
>>>>>>>>>> Then schedule a task on any of the node, here I was using slave
>>>>>>>>>> node mesos007, you can see that the two tasks was launched on 
>>>>>>>>>> different
>>>>>>>>>> host.
>>>>>>>>>>
>>>>>>>>>> root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute
>>>>>>>>>> --master=192.168.0.107:5050 --name="cluster-test"
>>>>>>>>>> --command="/bin/sleep 100" --resources="cpus(*):1;mem(*):256"
>>>>>>>>>> I1005 16:49:11.013432  2971 sched.cpp:164] Version: 0.26.0
>>>>>>>>>> I1005 16:49:11.027802  2992 sched.cpp:262] New master detected at
>>>>>>>>>> [email protected]:5050
>>>>>>>>>> I1005 16:49:11.029579  2992 sched.cpp:272] No credentials
>>>>>>>>>> provided. Attempting to register without authentication
>>>>>>>>>> I1005 16:49:11.038182  2985 sched.cpp:641] Framework registered
>>>>>>>>>> with c0e5fdde-595e-4768-9d04-25901d4523b6-0002
>>>>>>>>>> Framework registered with
>>>>>>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-0002
>>>>>>>>>> task cluster-test submitted to slave
>>>>>>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-S0  <<<<<<<<<<<<<<<<<<
>>>>>>>>>> Received status update TASK_RUNNING for task cluster-test
>>>>>>>>>> ^C
>>>>>>>>>> root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute
>>>>>>>>>> --master=192.168.0.107:5050 --name="cluster-test"
>>>>>>>>>> --command="/bin/sleep 100" --resources="cpus(*):1;mem(*):256"
>>>>>>>>>> I1005 16:50:18.346984  3036 sched.cpp:164] Version: 0.26.0
>>>>>>>>>> I1005 16:50:18.366114  3055 sched.cpp:262] New master detected at
>>>>>>>>>> [email protected]:5050
>>>>>>>>>> I1005 16:50:18.368010  3055 sched.cpp:272] No credentials
>>>>>>>>>> provided. Attempting to register without authentication
>>>>>>>>>> I1005 16:50:18.376338  3056 sched.cpp:641] Framework registered
>>>>>>>>>> with c0e5fdde-595e-4768-9d04-25901d4523b6-0003
>>>>>>>>>> Framework registered with
>>>>>>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-0003
>>>>>>>>>> task cluster-test submitted to slave
>>>>>>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-S1 <<<<<<<<<<<<<<<<<<<<
>>>>>>>>>> Received status update TASK_RUNNING for task cluster-test
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Guangya
>>>>>>>>>>
>>>>>>>>>> On Mon, Oct 5, 2015 at 4:21 PM, Pradeep Kiruvale <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Guangya,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for your reply.
>>>>>>>>>>>
>>>>>>>>>>> I just want to know how did you launch the tasks.
>>>>>>>>>>>
>>>>>>>>>>> 1. What processes you have started on Master?
>>>>>>>>>>> 2. What are the processes you have started on Slaves?
>>>>>>>>>>>
>>>>>>>>>>> I am missing something here, otherwise all my slave have enough
>>>>>>>>>>> memory and cpus to launch the tasks I mentioned.
>>>>>>>>>>> What I am missing is some configuration steps.
>>>>>>>>>>>
>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>> Pradeep
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 3 October 2015 at 13:14, Guangya Liu <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Pradeep,
>>>>>>>>>>>>
>>>>>>>>>>>> I did some test with your case and found that the task can run
>>>>>>>>>>>> randomly on the three slave hosts, every time may have different 
>>>>>>>>>>>> result.
>>>>>>>>>>>> The logic is here:
>>>>>>>>>>>> https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/hierarchical.hpp#L1263-#L1266
>>>>>>>>>>>> The allocator will help random shuffle the slaves every time
>>>>>>>>>>>> when allocate resources for offers.
>>>>>>>>>>>>
>>>>>>>>>>>> I see that every of your task need the minimum resources as "
>>>>>>>>>>>> resources="cpus(*):3;mem(*):2560", can you help check if all
>>>>>>>>>>>> of your slaves have enough resources? If you want your task run on 
>>>>>>>>>>>> other
>>>>>>>>>>>> slaves, then those slaves need to have at least 3 cpus and 2550M 
>>>>>>>>>>>> memory.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Oct 2, 2015 at 9:26 PM, Pradeep Kiruvale <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Ondrej,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for your reply
>>>>>>>>>>>>>
>>>>>>>>>>>>> I did solve that issue, yes you are right there was an issue
>>>>>>>>>>>>> with slave IP address setting.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Now I am facing issue with the scheduling the tasks. When I
>>>>>>>>>>>>> try to schedule a task using
>>>>>>>>>>>>>
>>>>>>>>>>>>> /src/mesos-execute --master=192.168.0.102:5050
>>>>>>>>>>>>> --name="cluster-test" --command="/usr/bin/hackbench -s 4096 -l 
>>>>>>>>>>>>> 10845760 -g
>>>>>>>>>>>>> 2 -f 2 -P" --resources="cpus(*):3;mem(*):2560"
>>>>>>>>>>>>>
>>>>>>>>>>>>> The tasks always get scheduled on the same node. The resources
>>>>>>>>>>>>> from the other nodes are not getting used to schedule the tasks.
>>>>>>>>>>>>>
>>>>>>>>>>>>>  I just start the mesos slaves like below
>>>>>>>>>>>>>
>>>>>>>>>>>>> ./bin/mesos-slave.sh --master=192.168.0.102:5050/mesos
>>>>>>>>>>>>>  --hostname=slave1
>>>>>>>>>>>>>
>>>>>>>>>>>>> If I submit the task using the above (mesos-execute) command
>>>>>>>>>>>>> from same as one of the slave it runs on that system.
>>>>>>>>>>>>>
>>>>>>>>>>>>> But when I submit the task from some different system. It uses
>>>>>>>>>>>>> just that system and queues the tasks not runs on the other 
>>>>>>>>>>>>> slaves.
>>>>>>>>>>>>> Some times I see the message "Failed to getgid: unknown user"
>>>>>>>>>>>>>
>>>>>>>>>>>>> Do I need to start some process to push the task on all the
>>>>>>>>>>>>> slaves equally? Am I missing something here?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Pradeep
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2 October 2015 at 15:07, Ondrej Smola <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Pradeep,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> the problem is with IP your slave advertise - mesos by
>>>>>>>>>>>>>> default resolves your hostname - there are several solutions  
>>>>>>>>>>>>>> (let say your
>>>>>>>>>>>>>> node ip is 192.168.56.128)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1)  export LIBPROCESS_IP=192.168.56.128
>>>>>>>>>>>>>> 2)  set mesos options - ip, hostname
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> one way to do this is to create files
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> echo "192.168.56.128" > /etc/mesos-slave/ip
>>>>>>>>>>>>>> echo "abc.mesos.com" > /etc/mesos-slave/hostname
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> for more configuration options see
>>>>>>>>>>>>>> http://mesos.apache.org/documentation/latest/configuration
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2015-10-02 10:06 GMT+02:00 Pradeep Kiruvale <
>>>>>>>>>>>>>> [email protected]>:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Guangya,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for reply. I found one interesting log message.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  7410 master.cpp:5977] Removed slave
>>>>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S52 (192.168.0.178): a new 
>>>>>>>>>>>>>>> slave
>>>>>>>>>>>>>>> registered at the same address
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Mostly because of this issue, the systems/slave nodes are
>>>>>>>>>>>>>>> getting registered and de-registered to make a room for the 
>>>>>>>>>>>>>>> next node. I
>>>>>>>>>>>>>>> can even see this on
>>>>>>>>>>>>>>> the UI interface, for some time one node got added and after
>>>>>>>>>>>>>>> some time that will be replaced with the new slave node.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The above log is followed by the below log messages.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I1002 10:01:12.753865  7416 leveldb.cpp:343] Persisting
>>>>>>>>>>>>>>> action (18 bytes) to leveldb took 104089ns
>>>>>>>>>>>>>>> I1002 10:01:12.753885  7416 replica.cpp:679] Persisted
>>>>>>>>>>>>>>> action at 384
>>>>>>>>>>>>>>> E1002 10:01:12.753891  7417 process.cpp:1912] Failed to
>>>>>>>>>>>>>>> shutdown socket with fd 15: Transport endpoint is not connected
>>>>>>>>>>>>>>> I1002 10:01:12.753988  7413 master.cpp:3930] Registered
>>>>>>>>>>>>>>> slave 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116) with cpus(*):8;
>>>>>>>>>>>>>>> mem(*):14930; disk(*):218578; ports(*):[31000-32000]
>>>>>>>>>>>>>>> I1002 10:01:12.754065  7413 master.cpp:1080] Slave
>>>>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116) disconnected
>>>>>>>>>>>>>>> I1002 10:01:12.754072  7416 hierarchical.hpp:675] Added
>>>>>>>>>>>>>>> slave 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 (192.168.0.116) 
>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>> cpus(*):8; mem(*):14930; disk(*):218578; ports(*):[31000-32000] 
>>>>>>>>>>>>>>> (allocated:
>>>>>>>>>>>>>>> )
>>>>>>>>>>>>>>> I1002 10:01:12.754084  7413 master.cpp:2534] Disconnecting
>>>>>>>>>>>>>>> slave 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116)
>>>>>>>>>>>>>>> E1002 10:01:12.754118  7417 process.cpp:1912] Failed to
>>>>>>>>>>>>>>> shutdown socket with fd 16: Transport endpoint is not connected
>>>>>>>>>>>>>>> I1002 10:01:12.754132  7413 master.cpp:2553] Deactivating
>>>>>>>>>>>>>>> slave 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116)
>>>>>>>>>>>>>>> I1002 10:01:12.754237  7416 hierarchical.hpp:768] Slave
>>>>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 deactivated
>>>>>>>>>>>>>>> I1002 10:01:12.754240  7413 replica.cpp:658] Replica
>>>>>>>>>>>>>>> received learned notice for position 384
>>>>>>>>>>>>>>> I1002 10:01:12.754360  7413 leveldb.cpp:343] Persisting
>>>>>>>>>>>>>>> action (20 bytes) to leveldb took 95171ns
>>>>>>>>>>>>>>> I1002 10:01:12.754395  7413 leveldb.cpp:401] Deleting ~2
>>>>>>>>>>>>>>> keys from leveldb took 20333ns
>>>>>>>>>>>>>>> I1002 10:01:12.754406  7413 replica.cpp:679] Persisted
>>>>>>>>>>>>>>> action at 384
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Pradeep
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 2 October 2015 at 02:35, Guangya Liu <[email protected]>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Pradeep,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Please check some of my questions in line.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Guangya
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Oct 2, 2015 at 12:55 AM, Pradeep Kiruvale <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I am new to Mesos. I have set up a Mesos cluster with 1
>>>>>>>>>>>>>>>>> Master and 3 Slaves.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> One slave runs on the Master Node itself and Other slaves
>>>>>>>>>>>>>>>>> run on different nodes. Here node means the physical boxes.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I tried running the tasks by configuring one Node cluster.
>>>>>>>>>>>>>>>>> Tested the task scheduling using mesos-execute, works fine.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> When I configure three Node cluster (1master and 3 slaves)
>>>>>>>>>>>>>>>>> and try to see the resources on the master (in GUI) only the 
>>>>>>>>>>>>>>>>> Master node
>>>>>>>>>>>>>>>>> resources are visible.
>>>>>>>>>>>>>>>>>  The other nodes resources are not visible. Some times
>>>>>>>>>>>>>>>>> visible but in a de-actived state.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Can you please append some logs from mesos-slave and
>>>>>>>>>>>>>>>> mesos-master? There should be some logs in either master or 
>>>>>>>>>>>>>>>> slave telling
>>>>>>>>>>>>>>>> you what is wrong.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> *Please let me know what could be the reason. All the
>>>>>>>>>>>>>>>>> nodes are in the same network. *
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> When I try to schedule a task using
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> /src/mesos-execute --master=192.168.0.102:5050
>>>>>>>>>>>>>>>>> --name="cluster-test" --command="/usr/bin/hackbench -s 4096 
>>>>>>>>>>>>>>>>> -l 10845760 -g
>>>>>>>>>>>>>>>>> 2 -f 2 -P" --resources="cpus(*):3;mem(*):2560"
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The tasks always get scheduled on the same node. The
>>>>>>>>>>>>>>>>> resources from the other nodes are not getting used to 
>>>>>>>>>>>>>>>>> schedule the tasks.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Based on your previous question, there is only one node in
>>>>>>>>>>>>>>>> your cluster, that's why other nodes are not available. We 
>>>>>>>>>>>>>>>> need first
>>>>>>>>>>>>>>>> identify what is wrong with other three nodes first.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I*s it required to register the frameworks from every
>>>>>>>>>>>>>>>>> slave node on the Master?*
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It is not required.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> *I have configured this cluster using the git-hub code.*
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>>>>>>> Pradeep
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Running a task in Mesos cluster

Reply via email to