Below are the logs from Master.

-Pradeep

1007 12:16:28.257853  8005 leveldb.cpp:343] Persisting action (20 bytes) to
leveldb took 119428ns
I1007 12:16:28.257884  8005 leveldb.cpp:401] Deleting ~2 keys from leveldb
took 18847ns
I1007 12:16:28.257891  8005 replica.cpp:679] Persisted action at 1440
I1007 12:16:28.257912  8005 replica.cpp:664] Replica learned TRUNCATE
action at position 1440
I1007 12:16:36.666616  8002 http.cpp:336] HTTP GET for /master/state.json
from 192.168.0.102:40721 with User-Agent='Mozilla/5.0 (X11; Linux x86_64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.52 Safari/537.36'
I1007 12:16:39.126030  8001 master.cpp:2179] Received SUBSCRIBE call for
framework 'Balloon Framework (C++)' at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
I1007 12:16:39.126428  8001 master.cpp:2250] Subscribing framework Balloon
Framework (C++) with checkpointing disabled and capabilities [  ]
E1007 12:16:39.127459  8007 process.cpp:1912] Failed to shutdown socket
with fd 13: Transport endpoint is not connected
I1007 12:16:39.127535  8000 hierarchical.hpp:515] Added framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000
I1007 12:16:39.127734  8001 master.cpp:1119] Framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000 (Balloon Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 disconnected
I1007 12:16:39.127765  8001 master.cpp:2475] Disconnecting framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000 (Balloon Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
E1007 12:16:39.127768  8007 process.cpp:1912] Failed to shutdown socket
with fd 14: Transport endpoint is not connected
I1007 12:16:39.127789  8001 master.cpp:2499] Deactivating framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000 (Balloon Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
I1007 12:16:39.127879  8006 hierarchical.hpp:599] Deactivated framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000
I1007 12:16:39.127913  8001 master.cpp:1143] Giving framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000 (Balloon Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 0ns to
failover
I1007 12:16:39.129273  8005 master.cpp:4815] Framework failover timeout,
removing framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000 (Balloon
Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
I1007 12:16:39.129312  8005 master.cpp:5571] Removing framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000 (Balloon Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
I1007 12:16:39.129858  8003 hierarchical.hpp:552] Removed framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000
I1007 12:16:40.676519  8000 master.cpp:2179] Received SUBSCRIBE call for
framework 'Balloon Framework (C++)' at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
I1007 12:16:40.676678  8000 master.cpp:2250] Subscribing framework Balloon
Framework (C++) with checkpointing disabled and capabilities [  ]
I1007 12:16:40.677178  8006 hierarchical.hpp:515] Added framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001
E1007 12:16:40.677217  8007 process.cpp:1912] Failed to shutdown socket
with fd 13: Transport endpoint is not connected
I1007 12:16:40.677409  8000 master.cpp:1119] Framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001 (Balloon Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 disconnected
I1007 12:16:40.677441  8000 master.cpp:2475] Disconnecting framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001 (Balloon Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
I1007 12:16:40.677453  8000 master.cpp:2499] Deactivating framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001 (Balloon Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
E1007 12:16:40.677459  8007 process.cpp:1912] Failed to shutdown socket
with fd 13: Transport endpoint is not connected
I1007 12:16:40.677501  8000 master.cpp:1143] Giving framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001 (Balloon Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 0ns to
failover
I1007 12:16:40.677520  8005 hierarchical.hpp:599] Deactivated framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001
I1007 12:16:40.678864  8004 master.cpp:4815] Framework failover timeout,
removing framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001 (Balloon
Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
I1007 12:16:40.678906  8004 master.cpp:5571] Removing framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001 (Balloon Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
I1007 12:16:40.679147  8001 hierarchical.hpp:552] Removed framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001
I1007 12:16:41.853121  8002 master.cpp:2179] Received SUBSCRIBE call for
framework 'Balloon Framework (C++)' at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
I1007 12:16:41.853281  8002 master.cpp:2250] Subscribing framework Balloon
Framework (C++) with checkpointing disabled and capabilities [  ]
E1007 12:16:41.853806  8007 process.cpp:1912] Failed to shutdown socket
with fd 13: Transport endpoint is not connected
I1007 12:16:41.853833  8004 hierarchical.hpp:515] Added framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002
I1007 12:16:41.854032  8002 master.cpp:1119] Framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002 (Balloon Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 disconnected
I1007 12:16:41.854063  8002 master.cpp:2475] Disconnecting framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002 (Balloon Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
I1007 12:16:41.854076  8002 master.cpp:2499] Deactivating framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002 (Balloon Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
E1007 12:16:41.854080  8007 process.cpp:1912] Failed to shutdown socket
with fd 13: Transport endpoint is not connected
I1007 12:16:41.854126  8005 hierarchical.hpp:599] Deactivated framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002
I1007 12:16:41.854121  8002 master.cpp:1143] Giving framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002 (Balloon Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 0ns to
failover
I1007 12:16:41.855482  8006 master.cpp:4815] Framework failover timeout,
removing framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002 (Balloon
Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
I1007 12:16:41.855515  8006 master.cpp:5571] Removing framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002 (Balloon Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
I1007 12:16:41.855692  8001 hierarchical.hpp:552] Removed framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002
I1007 12:16:42.772830  8000 master.cpp:2179] Received SUBSCRIBE call for
framework 'Balloon Framework (C++)' at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
I1007 12:16:42.772974  8000 master.cpp:2250] Subscribing framework Balloon
Framework (C++) with checkpointing disabled and capabilities [  ]
I1007 12:16:42.773470  8004 hierarchical.hpp:515] Added framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003
E1007 12:16:42.773495  8007 process.cpp:1912] Failed to shutdown socket
with fd 13: Transport endpoint is not connected
I1007 12:16:42.773679  8000 master.cpp:1119] Framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003 (Balloon Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 disconnected
I1007 12:16:42.773697  8000 master.cpp:2475] Disconnecting framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003 (Balloon Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
I1007 12:16:42.773708  8000 master.cpp:2499] Deactivating framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003 (Balloon Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
E1007 12:16:42.773710  8007 process.cpp:1912] Failed to shutdown socket
with fd 14: Transport endpoint is not connected
I1007 12:16:42.773761  8000 master.cpp:1143] Giving framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003 (Balloon Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 0ns to
failover
I1007 12:16:42.773779  8001 hierarchical.hpp:599] Deactivated framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003
I1007 12:16:42.775089  8005 master.cpp:4815] Framework failover timeout,
removing framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003 (Balloon
Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
I1007 12:16:42.775126  8005 master.cpp:5571] Removing framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003 (Balloon Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
I1007 12:16:42.775324  8005 hierarchical.hpp:552] Removed framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003
I1007 12:16:47.665941  8001 http.cpp:336] HTTP GET for /master/state.json
from 192.168.0.102:40722 with User-Agent='Mozilla/5.0 (X11; Linux x86_64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.52 Safari/537.36'


On 7 October 2015 at 12:12, Guangya Liu <gyliu...@gmail.com> wrote:

> Hi Pradeep,
>
> Can you please append more log for your master node? Just want to see what
> is wrong with your master, why the framework start to failover?
>
> Thanks,
>
> Guangya
>
> On Wed, Oct 7, 2015 at 5:27 PM, Pradeep Kiruvale <
> pradeepkiruv...@gmail.com> wrote:
>
>> Hi Guangya,
>>
>> I am running a frame work from some other physical node, which is part of
>> the same network. Still I am getting below messages and the framework not
>> getting registered.
>>
>> Any idea what is the reason?
>>
>> I1007 11:24:58.781914 32392 master.cpp:4815] Framework failover timeout,
>> removing framework 89b179d8-9fb7-4a61-ad03-a9a5525482ff-0019 (Balloon
>> Framework (C++)) at
>> scheduler-3848d80c-8d27-48e0-a6b7-7e1678d5401d@127.0.1.1:54203
>> I1007 11:24:58.781968 32392 master.cpp:5571] Removing framework
>> 89b179d8-9fb7-4a61-ad03-a9a5525482ff-0019 (Balloon Framework (C++)) at
>> scheduler-3848d80c-8d27-48e0-a6b7-7e1678d5401d@127.0.1.1:54203
>> I1007 11:24:58.782352 32392 hierarchical.hpp:552] Removed framework
>> 89b179d8-9fb7-4a61-ad03-a9a5525482ff-0019
>> E1007 11:24:58.782577 32399 process.cpp:1912] Failed to shutdown socket
>> with fd 13: Transport endpoint is not connected
>> I1007 11:24:59.699587 32396 master.cpp:2179] Received SUBSCRIBE call for
>> framework 'Balloon Framework (C++)' at
>> scheduler-3848d80c-8d27-48e0-a6b7-7e1678d5401d@127.0.1.1:54203
>> I1007 11:24:59.699717 32396 master.cpp:2250] Subscribing framework
>> Balloon Framework (C++) with checkpointing disabled and capabilities [  ]
>> I1007 11:24:59.700251 32393 hierarchical.hpp:515] Added framework
>> 89b179d8-9fb7-4a61-ad03-a9a5525482ff-0020
>> E1007 11:24:59.700253 32399 process.cpp:1912] Failed to shutdown socket
>> with fd 13: Transport endpoint is not connected
>>
>>
>> Regards,
>> Pradeep
>>
>>
>> On 5 October 2015 at 13:51, Guangya Liu <gyliu...@gmail.com> wrote:
>>
>>> Hi Pradeep,
>>>
>>> I think that the problem might be caused by that you are running the lxc
>>> container on master node and not sure if there are any port conflict or
>>> what else wrong.
>>>
>>> For my case, I was running the client in a new node but not on master
>>> node, perhaps you can have a try to put your client on a new node but not
>>> on master node.
>>>
>>> Thanks,
>>>
>>> Guangya
>>>
>>>
>>> On Mon, Oct 5, 2015 at 7:30 PM, Pradeep Kiruvale <
>>> pradeepkiruv...@gmail.com> wrote:
>>>
>>>> Hi Guangya,
>>>>
>>>> Hmm!...That is strange in my case!
>>>>
>>>> If I run from the mesos-execute on one of the slave/master node then
>>>> the tasks get their resources and they get scheduled well.
>>>> But if I start the mesos-execute on another node which is neither
>>>> slave/master then I have this issue.
>>>>
>>>> I am using an lxc container on master as a client to launch the tasks.
>>>> This is also in the same network as master/slaves.
>>>> And I just launch the task as you did. But the tasks are not getting
>>>> scheduled.
>>>>
>>>>
>>>> On master the logs are same as I sent you before
>>>>
>>>> Deactivating framework 77539063-89ce-4efa-a20b-ca788abbd912-0066
>>>>
>>>> On both of the slaves I can see the below logs
>>>>
>>>> I1005 13:23:32.547987  4831 slave.cpp:1980] Asked to shut down
>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0060 by
>>>> master@192.168.0.102:5050
>>>> W1005 13:23:32.548135  4831 slave.cpp:1995] Cannot shut down unknown
>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0060
>>>> I1005 13:23:33.697707  4833 slave.cpp:3926] Current disk usage 3.60%.
>>>> Max allowed age: 6.047984349521910days
>>>> I1005 13:23:34.098599  4829 slave.cpp:1980] Asked to shut down
>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0061 by
>>>> master@192.168.0.102:5050
>>>> W1005 13:23:34.098740  4829 slave.cpp:1995] Cannot shut down unknown
>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0061
>>>> I1005 13:23:35.274569  4831 slave.cpp:1980] Asked to shut down
>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0062 by
>>>> master@192.168.0.102:5050
>>>> W1005 13:23:35.274683  4831 slave.cpp:1995] Cannot shut down unknown
>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0062
>>>> I1005 13:23:36.193964  4829 slave.cpp:1980] Asked to shut down
>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0063 by
>>>> master@192.168.0.102:5050
>>>> W1005 13:23:36.194090  4829 slave.cpp:1995] Cannot shut down unknown
>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0063
>>>> I1005 13:24:01.914788  4827 slave.cpp:1980] Asked to shut down
>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0064 by
>>>> master@192.168.0.102:5050
>>>> W1005 13:24:01.914937  4827 slave.cpp:1995] Cannot shut down unknown
>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0064
>>>> I1005 13:24:03.469974  4833 slave.cpp:1980] Asked to shut down
>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0065 by
>>>> master@192.168.0.102:5050
>>>> W1005 13:24:03.470118  4833 slave.cpp:1995] Cannot shut down unknown
>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0065
>>>> I1005 13:24:04.642654  4826 slave.cpp:1980] Asked to shut down
>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0066 by
>>>> master@192.168.0.102:5050
>>>> W1005 13:24:04.642812  4826 slave.cpp:1995] Cannot shut down unknown
>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0066
>>>>
>>>>
>>>>
>>>> On 5 October 2015 at 13:09, Guangya Liu <gyliu...@gmail.com> wrote:
>>>>
>>>>> Hi Pradeep,
>>>>>
>>>>> From your log, seems that the master process is exiting and this
>>>>> caused the framework fail over to another mesos master. Can you please 
>>>>> show
>>>>> more detail for your issue reproduced steps?
>>>>>
>>>>> I did some test by running mesos-execute on a client host which does
>>>>> not have any mesos service and the task can schedule well.
>>>>>
>>>>> root@mesos008:~/src/mesos/m1/mesos/build# ./src/mesos-execute
>>>>> --master=192.168.0.107:5050 --name="cluster-test"
>>>>> --command="/bin/sleep 10" --resources="cpus(*):1;mem(*):256"
>>>>> I1005 18:59:47.974123  1233 sched.cpp:164] Version: 0.26.0
>>>>> I1005 18:59:47.990890  1248 sched.cpp:262] New master detected at
>>>>> master@192.168.0.107:5050
>>>>> I1005 18:59:47.993074  1248 sched.cpp:272] No credentials provided.
>>>>> Attempting to register without authentication
>>>>> I1005 18:59:48.001194  1249 sched.cpp:641] Framework registered with
>>>>> 04b9af5e-e9b6-4c59-8734-eba407163922-0002
>>>>> Framework registered with 04b9af5e-e9b6-4c59-8734-eba407163922-0002
>>>>> task cluster-test submitted to slave
>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-S0
>>>>> Received status update TASK_RUNNING for task cluster-test
>>>>> Received status update TASK_FINISHED for task cluster-test
>>>>> I1005 18:59:58.431144  1249 sched.cpp:1771] Asked to stop the driver
>>>>> I1005 18:59:58.431591  1249 sched.cpp:1040] Stopping framework
>>>>> '04b9af5e-e9b6-4c59-8734-eba407163922-0002'
>>>>> root@mesos008:~/src/mesos/m1/mesos/build# ps -ef | grep mesos
>>>>> root      1259  1159  0 19:06 pts/0    00:00:00 grep --color=auto mesos
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Guangya
>>>>>
>>>>>
>>>>> On Mon, Oct 5, 2015 at 6:50 PM, Pradeep Kiruvale <
>>>>> pradeepkiruv...@gmail.com> wrote:
>>>>>
>>>>>> Hi Guangya,
>>>>>>
>>>>>> I am facing one more issue. If I try to schedule the tasks from some
>>>>>> external client system running the same cli mesos-execute.
>>>>>> The tasks are not getting launched. The tasks reach the Master and it
>>>>>> just drops the requests, below are the logs related to that
>>>>>>
>>>>>> I1005 11:33:35.025594 21369 master.cpp:2250] Subscribing framework
>>>>>>  with checkpointing disabled and capabilities [  ]
>>>>>> E1005 11:33:35.026100 21373 process.cpp:1912] Failed to shutdown
>>>>>> socket with fd 14: Transport endpoint is not connected
>>>>>> I1005 11:33:35.026129 21372 hierarchical.hpp:515] Added framework
>>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>>>> I1005 11:33:35.026298 21369 master.cpp:1119] Framework
>>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>>>>> disconnected
>>>>>> I1005 11:33:35.026329 21369 master.cpp:2475] Disconnecting framework
>>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>>>>> I1005 11:33:35.026340 21369 master.cpp:2499] Deactivating framework
>>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>>>>> E1005 11:33:35.026345 21373 process.cpp:1912] Failed to shutdown
>>>>>> socket with fd 14: Transport endpoint is not connected
>>>>>> I1005 11:33:35.026376 21369 master.cpp:1143] Giving framework
>>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259 0ns
>>>>>> to failover
>>>>>> I1005 11:33:35.026743 21372 hierarchical.hpp:599] Deactivated
>>>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>>>> W1005 11:33:35.026757 21368 master.cpp:4828] Master returning
>>>>>> resources offered to framework 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>>>> because the framework has terminated or is inactive
>>>>>> I1005 11:33:35.027014 21371 hierarchical.hpp:1103] Recovered
>>>>>> cpus(*):8; mem(*):14868; disk(*):218835; ports(*):[31000-32000] (total:
>>>>>> cpus(*):8; mem(*):14868; disk(*):218835; ports(*):[31000-32000], 
>>>>>> allocated:
>>>>>> ) on slave 77539063-89ce-4efa-a20b-ca788abbd912-S2 from framework
>>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>>>> I1005 11:33:35.027159 21371 hierarchical.hpp:1103] Recovered
>>>>>> cpus(*):8; mem(*):14930; disk(*):218578; ports(*):[31000-32000] (total:
>>>>>> cpus(*):8; mem(*):14930; disk(*):218578; ports(*):[31000-32000], 
>>>>>> allocated:
>>>>>> ) on slave 77539063-89ce-4efa-a20b-ca788abbd912-S1 from framework
>>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>>>> I1005 11:33:35.027668 21366 master.cpp:4815] Framework failover
>>>>>> timeout, removing framework 77539063-89ce-4efa-a20b-ca788abbd912-0055 () 
>>>>>> at
>>>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>>>>> I1005 11:33:35.027715 21366 master.cpp:5571] Removing framework
>>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>>>>>
>>>>>>
>>>>>> Can you please tell me what is the reason? The client is in the same
>>>>>> network as well. But it does not run any master or slave processes.
>>>>>>
>>>>>> Thanks & Regards,
>>>>>> Pradeeep
>>>>>>
>>>>>> On 5 October 2015 at 12:13, Guangya Liu <gyliu...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Pradeep,
>>>>>>>
>>>>>>> Glad it finally works! Not sure if you are using systemd.slice or
>>>>>>> not, are you running to this issue:
>>>>>>> https://issues.apache.org/jira/browse/MESOS-1195
>>>>>>>
>>>>>>> Hope Jie Yu can give you some help on this ;-)
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Guangya
>>>>>>>
>>>>>>> On Mon, Oct 5, 2015 at 5:25 PM, Pradeep Kiruvale <
>>>>>>> pradeepkiruv...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Guangya,
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks for sharing the information.
>>>>>>>>
>>>>>>>> Now I could launch the tasks. The problem was with the permission.
>>>>>>>> If I start all the slaves and Master as root it works fine.
>>>>>>>> Else I have problem with launching the tasks.
>>>>>>>>
>>>>>>>> But on one of the slave I could not launch the slave as root, I am
>>>>>>>> facing the following issue.
>>>>>>>>
>>>>>>>> Failed to create a containerizer: Could not create
>>>>>>>> MesosContainerizer: Failed to create launcher: Failed to create Linux
>>>>>>>> launcher: Failed to mount cgroups hierarchy at 
>>>>>>>> '/sys/fs/cgroup/freezer':
>>>>>>>> 'freezer' is already attached to another hierarchy
>>>>>>>>
>>>>>>>> I took that out from the cluster for now. The tasks are getting
>>>>>>>> scheduled on the other two slave nodes.
>>>>>>>>
>>>>>>>> Thanks for your timely help
>>>>>>>>
>>>>>>>> -Pradeep
>>>>>>>>
>>>>>>>> On 5 October 2015 at 10:54, Guangya Liu <gyliu...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Pradeep,
>>>>>>>>>
>>>>>>>>> My steps was pretty simple just as
>>>>>>>>> https://github.com/apache/mesos/blob/master/docs/getting-started.md#examples
>>>>>>>>>
>>>>>>>>> On Master node: root@mesos1:~/src/mesos/m1/mesos/build# GLOG_v=1
>>>>>>>>>  ./bin/mesos-master.sh --ip=192.168.0.107 --work_dir=/var/lib/mesos
>>>>>>>>> On 3 Slave node: root@mesos007:~/src/mesos/m1/mesos/build#
>>>>>>>>> GLOG_v=1 ./bin/mesos-slave.sh --master=192.168.0.107:5050
>>>>>>>>>
>>>>>>>>> Then schedule a task on any of the node, here I was using slave
>>>>>>>>> node mesos007, you can see that the two tasks was launched on 
>>>>>>>>> different
>>>>>>>>> host.
>>>>>>>>>
>>>>>>>>> root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute
>>>>>>>>> --master=192.168.0.107:5050 --name="cluster-test"
>>>>>>>>> --command="/bin/sleep 100" --resources="cpus(*):1;mem(*):256"
>>>>>>>>> I1005 16:49:11.013432  2971 sched.cpp:164] Version: 0.26.0
>>>>>>>>> I1005 16:49:11.027802  2992 sched.cpp:262] New master detected at
>>>>>>>>> master@192.168.0.107:5050
>>>>>>>>> I1005 16:49:11.029579  2992 sched.cpp:272] No credentials
>>>>>>>>> provided. Attempting to register without authentication
>>>>>>>>> I1005 16:49:11.038182  2985 sched.cpp:641] Framework registered
>>>>>>>>> with c0e5fdde-595e-4768-9d04-25901d4523b6-0002
>>>>>>>>> Framework registered with c0e5fdde-595e-4768-9d04-25901d4523b6-0002
>>>>>>>>> task cluster-test submitted to slave
>>>>>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-S0  <<<<<<<<<<<<<<<<<<
>>>>>>>>> Received status update TASK_RUNNING for task cluster-test
>>>>>>>>> ^C
>>>>>>>>> root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute
>>>>>>>>> --master=192.168.0.107:5050 --name="cluster-test"
>>>>>>>>> --command="/bin/sleep 100" --resources="cpus(*):1;mem(*):256"
>>>>>>>>> I1005 16:50:18.346984  3036 sched.cpp:164] Version: 0.26.0
>>>>>>>>> I1005 16:50:18.366114  3055 sched.cpp:262] New master detected at
>>>>>>>>> master@192.168.0.107:5050
>>>>>>>>> I1005 16:50:18.368010  3055 sched.cpp:272] No credentials
>>>>>>>>> provided. Attempting to register without authentication
>>>>>>>>> I1005 16:50:18.376338  3056 sched.cpp:641] Framework registered
>>>>>>>>> with c0e5fdde-595e-4768-9d04-25901d4523b6-0003
>>>>>>>>> Framework registered with c0e5fdde-595e-4768-9d04-25901d4523b6-0003
>>>>>>>>> task cluster-test submitted to slave
>>>>>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-S1 <<<<<<<<<<<<<<<<<<<<
>>>>>>>>> Received status update TASK_RUNNING for task cluster-test
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Guangya
>>>>>>>>>
>>>>>>>>> On Mon, Oct 5, 2015 at 4:21 PM, Pradeep Kiruvale <
>>>>>>>>> pradeepkiruv...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Guangya,
>>>>>>>>>>
>>>>>>>>>> Thanks for your reply.
>>>>>>>>>>
>>>>>>>>>> I just want to know how did you launch the tasks.
>>>>>>>>>>
>>>>>>>>>> 1. What processes you have started on Master?
>>>>>>>>>> 2. What are the processes you have started on Slaves?
>>>>>>>>>>
>>>>>>>>>> I am missing something here, otherwise all my slave have enough
>>>>>>>>>> memory and cpus to launch the tasks I mentioned.
>>>>>>>>>> What I am missing is some configuration steps.
>>>>>>>>>>
>>>>>>>>>> Thanks & Regards,
>>>>>>>>>> Pradeep
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 3 October 2015 at 13:14, Guangya Liu <gyliu...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Pradeep,
>>>>>>>>>>>
>>>>>>>>>>> I did some test with your case and found that the task can run
>>>>>>>>>>> randomly on the three slave hosts, every time may have different 
>>>>>>>>>>> result.
>>>>>>>>>>> The logic is here:
>>>>>>>>>>> https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/hierarchical.hpp#L1263-#L1266
>>>>>>>>>>> The allocator will help random shuffle the slaves every time
>>>>>>>>>>> when allocate resources for offers.
>>>>>>>>>>>
>>>>>>>>>>> I see that every of your task need the minimum resources as "
>>>>>>>>>>> resources="cpus(*):3;mem(*):2560", can you help check if all of
>>>>>>>>>>> your slaves have enough resources? If you want your task run on 
>>>>>>>>>>> other
>>>>>>>>>>> slaves, then those slaves need to have at least 3 cpus and 2550M 
>>>>>>>>>>> memory.
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Oct 2, 2015 at 9:26 PM, Pradeep Kiruvale <
>>>>>>>>>>> pradeepkiruv...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Ondrej,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for your reply
>>>>>>>>>>>>
>>>>>>>>>>>> I did solve that issue, yes you are right there was an issue
>>>>>>>>>>>> with slave IP address setting.
>>>>>>>>>>>>
>>>>>>>>>>>> Now I am facing issue with the scheduling the tasks. When I try
>>>>>>>>>>>> to schedule a task using
>>>>>>>>>>>>
>>>>>>>>>>>> /src/mesos-execute --master=192.168.0.102:5050
>>>>>>>>>>>> --name="cluster-test" --command="/usr/bin/hackbench -s 4096 -l 
>>>>>>>>>>>> 10845760 -g
>>>>>>>>>>>> 2 -f 2 -P" --resources="cpus(*):3;mem(*):2560"
>>>>>>>>>>>>
>>>>>>>>>>>> The tasks always get scheduled on the same node. The resources
>>>>>>>>>>>> from the other nodes are not getting used to schedule the tasks.
>>>>>>>>>>>>
>>>>>>>>>>>>  I just start the mesos slaves like below
>>>>>>>>>>>>
>>>>>>>>>>>> ./bin/mesos-slave.sh --master=192.168.0.102:5050/mesos
>>>>>>>>>>>>  --hostname=slave1
>>>>>>>>>>>>
>>>>>>>>>>>> If I submit the task using the above (mesos-execute) command
>>>>>>>>>>>> from same as one of the slave it runs on that system.
>>>>>>>>>>>>
>>>>>>>>>>>> But when I submit the task from some different system. It uses
>>>>>>>>>>>> just that system and queues the tasks not runs on the other slaves.
>>>>>>>>>>>> Some times I see the message "Failed to getgid: unknown user"
>>>>>>>>>>>>
>>>>>>>>>>>> Do I need to start some process to push the task on all the
>>>>>>>>>>>> slaves equally? Am I missing something here?
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Pradeep
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 2 October 2015 at 15:07, Ondrej Smola <
>>>>>>>>>>>> ondrej.sm...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Pradeep,
>>>>>>>>>>>>>
>>>>>>>>>>>>> the problem is with IP your slave advertise - mesos by default
>>>>>>>>>>>>> resolves your hostname - there are several solutions  (let say 
>>>>>>>>>>>>> your node ip
>>>>>>>>>>>>> is 192.168.56.128)
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1)  export LIBPROCESS_IP=192.168.56.128
>>>>>>>>>>>>> 2)  set mesos options - ip, hostname
>>>>>>>>>>>>>
>>>>>>>>>>>>> one way to do this is to create files
>>>>>>>>>>>>>
>>>>>>>>>>>>> echo "192.168.56.128" > /etc/mesos-slave/ip
>>>>>>>>>>>>> echo "abc.mesos.com" > /etc/mesos-slave/hostname
>>>>>>>>>>>>>
>>>>>>>>>>>>> for more configuration options see
>>>>>>>>>>>>> http://mesos.apache.org/documentation/latest/configuration
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2015-10-02 10:06 GMT+02:00 Pradeep Kiruvale <
>>>>>>>>>>>>> pradeepkiruv...@gmail.com>:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Guangya,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for reply. I found one interesting log message.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  7410 master.cpp:5977] Removed slave
>>>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S52 (192.168.0.178): a new 
>>>>>>>>>>>>>> slave
>>>>>>>>>>>>>> registered at the same address
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Mostly because of this issue, the systems/slave nodes are
>>>>>>>>>>>>>> getting registered and de-registered to make a room for the next 
>>>>>>>>>>>>>> node. I
>>>>>>>>>>>>>> can even see this on
>>>>>>>>>>>>>> the UI interface, for some time one node got added and after
>>>>>>>>>>>>>> some time that will be replaced with the new slave node.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The above log is followed by the below log messages.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I1002 10:01:12.753865  7416 leveldb.cpp:343] Persisting
>>>>>>>>>>>>>> action (18 bytes) to leveldb took 104089ns
>>>>>>>>>>>>>> I1002 10:01:12.753885  7416 replica.cpp:679] Persisted action
>>>>>>>>>>>>>> at 384
>>>>>>>>>>>>>> E1002 10:01:12.753891  7417 process.cpp:1912] Failed to
>>>>>>>>>>>>>> shutdown socket with fd 15: Transport endpoint is not connected
>>>>>>>>>>>>>> I1002 10:01:12.753988  7413 master.cpp:3930] Registered slave
>>>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116) with cpus(*):8; mem(*):14930;
>>>>>>>>>>>>>> disk(*):218578; ports(*):[31000-32000]
>>>>>>>>>>>>>> I1002 10:01:12.754065  7413 master.cpp:1080] Slave
>>>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116) disconnected
>>>>>>>>>>>>>> I1002 10:01:12.754072  7416 hierarchical.hpp:675] Added slave
>>>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 (192.168.0.116) with 
>>>>>>>>>>>>>> cpus(*):8;
>>>>>>>>>>>>>> mem(*):14930; disk(*):218578; ports(*):[31000-32000] (allocated: 
>>>>>>>>>>>>>> )
>>>>>>>>>>>>>> I1002 10:01:12.754084  7413 master.cpp:2534] Disconnecting
>>>>>>>>>>>>>> slave 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116)
>>>>>>>>>>>>>> E1002 10:01:12.754118  7417 process.cpp:1912] Failed to
>>>>>>>>>>>>>> shutdown socket with fd 16: Transport endpoint is not connected
>>>>>>>>>>>>>> I1002 10:01:12.754132  7413 master.cpp:2553] Deactivating
>>>>>>>>>>>>>> slave 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116)
>>>>>>>>>>>>>> I1002 10:01:12.754237  7416 hierarchical.hpp:768] Slave
>>>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 deactivated
>>>>>>>>>>>>>> I1002 10:01:12.754240  7413 replica.cpp:658] Replica received
>>>>>>>>>>>>>> learned notice for position 384
>>>>>>>>>>>>>> I1002 10:01:12.754360  7413 leveldb.cpp:343] Persisting
>>>>>>>>>>>>>> action (20 bytes) to leveldb took 95171ns
>>>>>>>>>>>>>> I1002 10:01:12.754395  7413 leveldb.cpp:401] Deleting ~2 keys
>>>>>>>>>>>>>> from leveldb took 20333ns
>>>>>>>>>>>>>> I1002 10:01:12.754406  7413 replica.cpp:679] Persisted action
>>>>>>>>>>>>>> at 384
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Pradeep
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 2 October 2015 at 02:35, Guangya Liu <gyliu...@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Pradeep,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Please check some of my questions in line.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Guangya
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Oct 2, 2015 at 12:55 AM, Pradeep Kiruvale <
>>>>>>>>>>>>>>> pradeepkiruv...@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I am new to Mesos. I have set up a Mesos cluster with 1
>>>>>>>>>>>>>>>> Master and 3 Slaves.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> One slave runs on the Master Node itself and Other slaves
>>>>>>>>>>>>>>>> run on different nodes. Here node means the physical boxes.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I tried running the tasks by configuring one Node cluster.
>>>>>>>>>>>>>>>> Tested the task scheduling using mesos-execute, works fine.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> When I configure three Node cluster (1master and 3 slaves)
>>>>>>>>>>>>>>>> and try to see the resources on the master (in GUI) only the 
>>>>>>>>>>>>>>>> Master node
>>>>>>>>>>>>>>>> resources are visible.
>>>>>>>>>>>>>>>>  The other nodes resources are not visible. Some times
>>>>>>>>>>>>>>>> visible but in a de-actived state.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Can you please append some logs from mesos-slave and
>>>>>>>>>>>>>>> mesos-master? There should be some logs in either master or 
>>>>>>>>>>>>>>> slave telling
>>>>>>>>>>>>>>> you what is wrong.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *Please let me know what could be the reason. All the nodes
>>>>>>>>>>>>>>>> are in the same network. *
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> When I try to schedule a task using
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> /src/mesos-execute --master=192.168.0.102:5050
>>>>>>>>>>>>>>>> --name="cluster-test" --command="/usr/bin/hackbench -s 4096 -l 
>>>>>>>>>>>>>>>> 10845760 -g
>>>>>>>>>>>>>>>> 2 -f 2 -P" --resources="cpus(*):3;mem(*):2560"
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The tasks always get scheduled on the same node. The
>>>>>>>>>>>>>>>> resources from the other nodes are not getting used to 
>>>>>>>>>>>>>>>> schedule the tasks.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Based on your previous question, there is only one node in
>>>>>>>>>>>>>>> your cluster, that's why other nodes are not available. We need 
>>>>>>>>>>>>>>> first
>>>>>>>>>>>>>>> identify what is wrong with other three nodes first.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I*s it required to register the frameworks from every
>>>>>>>>>>>>>>>> slave node on the Master?*
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It is not required.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *I have configured this cluster using the git-hub code.*
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>>>>>> Pradeep
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to