Re: Welcome Kevin Klues as a Mesos Committer and PMC member!

2017-03-02 Thread Jay Guo
Congrats Kevin! Well deserved!

/J

On Thu, Mar 2, 2017 at 6:05 AM, Benjamin Mahler  wrote:
> Hi all,
>
> Please welcome Kevin Klues as the newest committer and PMC member of the
> Apache Mesos project.
>
> Kevin has been an active contributor in the project for over a year, and in
> this time he made a number of contributions to the project: Nvidia GPU
> support [1], the containerization side of POD support (new container init
> process), and support for "attach" and "exec" of commands within running
> containers [2].
>
> Also, Kevin took on an effort with Haris Choudhary to revive the CLI [3]
> via a better structured python implementation (to be more accessible to
> contributors) and a more extensible architecture to better support adding
> new or custom subcommands. The work also adds a unit test framework for the
> CLI functionality (we had no tests previously!). I think it's great that
> Kevin took on this much needed improvement with Haris, and I'm very much
> looking forward to seeing this land in the project.
>
> Here is his committer eligibility document for perusal:
> https://docs.google.com/document/d/1mlO1yyLCoCSd85XeDKIxTYyboK_uiOJ4Uwr6ruKTlFM/edit
>
> Thanks!
> Ben
>
> [1] http://mesos.apache.org/documentation/latest/gpu-support/
> [2]
> https://docs.google.com/document/d/1nAVr0sSSpbDLrgUlAEB5hKzCl482NSVk8V0D56sFMzU
> [3]
> https://docs.google.com/document/d/1r6Iv4Efu8v8IBrcUTjgYkvZ32WVscgYqrD07OyIglsA/


Re: Messos do not assign all available resources

2017-03-02 Thread Benjamin Mahler
What Gabriel is alluding to is a situation where you have:

* Frameworks with lower shares that do not want additional resources, and
* Frameworks with a higher shares that want additional resources.

If there are a sufficient number of frameworks, it's possible for the
decline filters of the low share frameworks to expire before we get a
chance to offer resources to the high share frameworks. In this case, we
are stuck offering to the low share frameworks and never get a chance to
offer to the high share frameworks.

I can't tell yet if this is what is occurring in your setup, but the
recommendation is to update the scheduler to make a SUPPRESS call to tell
mesos it does not want any more resources (and REVIVE later if it wants
resources). In your case that means that once the task list is emptied, you
should send a SUPPRESS message.

Ben



On Thu, Mar 2, 2017 at 4:33 PM, Gabriel Hartmann 
wrote:

> Possibly the suppress/revive problem.
>
> On Thu, Mar 2, 2017 at 4:30 PM Benjamin Mahler  wrote:
>
>> Can you upload the full logs somewhere and link to them here?
>>
>> How many frameworks are you running? Do they all run in the "*" role?
>> Are the tasks short lived or long lived?
>> Can you update your test to not use the --offer_timeout? The intention of
>> that is to mitigate against frameworks that hold on to offers, but it
>> sounds like your frameworks decline.
>>
>> On Thu, Mar 2, 2017 at 3:57 PM, Harold Molina-Bulla > > wrote:
>>
>> Hi,
>>
>> Thanks for your reply.
>>
>> Hi there, more clarification is needed:
>>
>> I have close to 800 CPUs, but the system does not assign all the
>> available resources to all our tasks.
>>
>> What do you mean precisely here? Can you describe what you're seeing?
>> Also, you have more than 800GB or RAM right?
>>
>>
>> Yes, we have at least 2GBytes per CPU, and typically our resource table
>> looks like:
>>
>> In this case 346/788 cpus are available and not assigned to any task, but
>> we have more than 400 tasks waiting to be running.
>>
>> Checking the mesos-master log, it not make offers to all running
>> frameworks all the time, just a few ones:
>>
>> I0303 00:16:01.964318 31791 master.cpp:6517] Sending 3 offers to
>> framework 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0053 (Ejecucion: FRUS) at
>> scheduler-52a267e9-30d1-4cc8-847e-fa7acfddf855@192.168.151.147:32899
>> I0303 00:16:01.966234 31791 master.cpp:6517] Sending 5 offers to
>> framework 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0072 (:izanami) at
>> scheduler-ce746b8b-adac-4a0c-8310-5d312c9ed04f@192.168.151.186:44233
>> I0303 00:16:01.968003 31791 master.cpp:6517] Sending 6 offers to
>> framework 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0084 (vatmoutput) at
>> scheduler-078b1978-840a-437e-a23e-5bca8c5e05c8@192.168.151.84:43023
>> I0303 00:16:01.969828 31791 master.cpp:6517] Sending 6 offers to
>> framework 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0081 (vatmoutput) at
>> scheduler-d921e4bb-ee23-4e77-93d9-7742264839e5@192.168.151.84:43067
>> I0303 00:16:01.971613 31791 master.cpp:6517] Sending 6 offers to
>> framework c5299003-e29d-43cb-8ca7-887ab24c8513-0175 (:izanami) at
>> scheduler-e10a1167-62d7-4ded-b932-792b5478ab61@192.168.151.186:38706
>> I0303 00:16:01.973351 31791 master.cpp:6517] Sending 6 offers to
>> framework 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0082 (vatmoutputg) at
>> scheduler-c4db35be-41e1-45cb-8005-f0f7827a23d0@192.168.151.84:33668
>> I0303 00:16:01.975126 31791 master.cpp:6517] Sending 6 offers to
>> framework 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0062 (vatmvalidation) at
>> scheduler-44ed1457-a752-4037-89b6-590221db3de5@192.168.151.84:33148
>> I0303 00:16:01.976877 31791 master.cpp:6517] Sending 6 offers to
>> framework 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0077 (:izanami) at
>> scheduler-c648708f-32f3-44d5-9014-3fd0dbb461f7@192.168.151.186:35345
>> I0303 00:16:01.978590 31791 master.cpp:6517] Sending 6 offers to
>> framework 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0083 (vatmoutputg) at
>> scheduler-fb965e89-5764-4a07-a94a-43de45babc7a@192.168.151.84:39218
>>
>> We have close to twice Frameworks running in this moment, one of them
>> (not included) with more than 300 tasks waiting and just 100 cpus assigned
>> (1 cpu per task).
>>
>> The problem is (we think): the mesos-master does not offers resources to
>> all the tasks all the time and the declined resources are not re-offered to
>> other tasks. Any idea to how to change the behavior or the rate to offer
>> resources to the tasks?
>>
>> FYI We set the --offer_timeout=1sec
>>
>> Thanks in advance.
>>
>> Harold Molina-Bulla Ph.D.
>> On 02/03/2017 23:28, Benjamin Mahler wrote:
>>
>>
>> Ben
>>
>> On Thu, Mar 2, 2017 at 9:00 AM, Harold Molina-Bulla > > wrote:
>>
>> Hi Everybody,
>>
>> We are trying to develop an Scheduler in Python to distribute processes
>> in a Mesos cluster.
>>
>> I have close to 800 CPUs, but the system does not assign all the
>> available resources to 

Re: Messos do not assign all available resources

2017-03-02 Thread Benjamin Mahler
Also, what is the allocation that each framework has when you reach your
steady state?
Are there frameworks that don't have any more work to do but have a really
low share of the cluster?

On Thu, Mar 2, 2017 at 4:29 PM, Benjamin Mahler  wrote:

> Can you upload the full logs somewhere and link to them here?
>
> How many frameworks are you running? Do they all run in the "*" role?
> Are the tasks short lived or long lived?
> Can you update your test to not use the --offer_timeout? The intention of
> that is to mitigate against frameworks that hold on to offers, but it
> sounds like your frameworks decline.
>
> On Thu, Mar 2, 2017 at 3:57 PM, Harold Molina-Bulla 
> wrote:
>
>> Hi,
>>
>> Thanks for your reply.
>>
>> Hi there, more clarification is needed:
>>
>>> I have close to 800 CPUs, but the system does not assign all the
>>> available resources to all our tasks.
>>>
>> What do you mean precisely here? Can you describe what you're seeing?
>> Also, you have more than 800GB or RAM right?
>>
>>
>> Yes, we have at least 2GBytes per CPU, and typically our resource table
>> looks like:
>>
>> In this case 346/788 cpus are available and not assigned to any task, but
>> we have more than 400 tasks waiting to be running.
>>
>> Checking the mesos-master log, it not make offers to all running
>> frameworks all the time, just a few ones:
>>
>> I0303 00:16:01.964318 31791 master.cpp:6517] Sending 3 offers to
>> framework 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0053 (Ejecucion: FRUS) at
>> scheduler-52a267e9-30d1-4cc8-847e-fa7acfddf855@192.168.151.147:32899
>> I0303 00:16:01.966234 31791 master.cpp:6517] Sending 5 offers to
>> framework 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0072 (:izanami) at
>> scheduler-ce746b8b-adac-4a0c-8310-5d312c9ed04f@192.168.151.186:44233
>> I0303 00:16:01.968003 31791 master.cpp:6517] Sending 6 offers to
>> framework 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0084 (vatmoutput) at
>> scheduler-078b1978-840a-437e-a23e-5bca8c5e05c8@192.168.151.84:43023
>> I0303 00:16:01.969828 31791 master.cpp:6517] Sending 6 offers to
>> framework 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0081 (vatmoutput) at
>> scheduler-d921e4bb-ee23-4e77-93d9-7742264839e5@192.168.151.84:43067
>> I0303 00:16:01.971613 31791 master.cpp:6517] Sending 6 offers to
>> framework c5299003-e29d-43cb-8ca7-887ab24c8513-0175 (:izanami) at
>> scheduler-e10a1167-62d7-4ded-b932-792b5478ab61@192.168.151.186:38706
>> I0303 00:16:01.973351 31791 master.cpp:6517] Sending 6 offers to
>> framework 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0082 (vatmoutputg) at
>> scheduler-c4db35be-41e1-45cb-8005-f0f7827a23d0@192.168.151.84:33668
>> I0303 00:16:01.975126 31791 master.cpp:6517] Sending 6 offers to
>> framework 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0062 (vatmvalidation) at
>> scheduler-44ed1457-a752-4037-89b6-590221db3de5@192.168.151.84:33148
>> I0303 00:16:01.976877 31791 master.cpp:6517] Sending 6 offers to
>> framework 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0077 (:izanami) at
>> scheduler-c648708f-32f3-44d5-9014-3fd0dbb461f7@192.168.151.186:35345
>> I0303 00:16:01.978590 31791 master.cpp:6517] Sending 6 offers to
>> framework 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0083 (vatmoutputg) at
>> scheduler-fb965e89-5764-4a07-a94a-43de45babc7a@192.168.151.84:39218
>>
>> We have close to twice Frameworks running in this moment, one of them
>> (not included) with more than 300 tasks waiting and just 100 cpus assigned
>> (1 cpu per task).
>>
>> The problem is (we think): the mesos-master does not offers resources to
>> all the tasks all the time and the declined resources are not re-offered to
>> other tasks. Any idea to how to change the behavior or the rate to offer
>> resources to the tasks?
>>
>> FYI We set the --offer_timeout=1sec
>>
>> Thanks in advance.
>>
>> Harold Molina-Bulla Ph.D.
>> On 02/03/2017 23:28, Benjamin Mahler wrote:
>>
>>
>> Ben
>>
>> On Thu, Mar 2, 2017 at 9:00 AM, Harold Molina-Bulla > > wrote:
>>
>>> Hi Everybody,
>>>
>>> We are trying to develop an Scheduler in Python to distribute processes
>>> in a Mesos cluster.
>>>
>>> I have close to 800 CPUs, but the system does not assign all the
>>> available resources to all our tasks.
>>>
>>> In order to test, we are defining: 1 CPU, 1Gbyte RAM per process in
>>> order all the process fits on our machines. And launch several scripts
>>> simultaneous in order to have Nprocs > Ncpus (close 900 tasks in total).
>>>
>>> Our script is based on the test_framework.py example included in the
>>> Mesos src distribution, with changes like if the list of tasks to launch is
>>> empty, send an decline message.
>>>
>>> We have deployed Mesos 1.1.0.
>>>
>>> Any ideas in order the improvement the use of our resources?
>>>
>>> Thx in advance!
>>> Harold Molina-Bulla Ph.D.
>>> --
>>>
>>> *"En una época de mentira universal, decir la verdad constituye un acto
>>> revolucionario”*
>>> George Orwell (1984)
>>>
>>> Recuerda: PRISM te está vigilando!!! X)
>>> *Harold 

Re: Messos do not assign all available resources

2017-03-02 Thread Gabriel Hartmann
Possibly the suppress/revive problem.

On Thu, Mar 2, 2017 at 4:30 PM Benjamin Mahler  wrote:

> Can you upload the full logs somewhere and link to them here?
>
> How many frameworks are you running? Do they all run in the "*" role?
> Are the tasks short lived or long lived?
> Can you update your test to not use the --offer_timeout? The intention of
> that is to mitigate against frameworks that hold on to offers, but it
> sounds like your frameworks decline.
>
> On Thu, Mar 2, 2017 at 3:57 PM, Harold Molina-Bulla 
> wrote:
>
> Hi,
>
> Thanks for your reply.
>
> Hi there, more clarification is needed:
>
> I have close to 800 CPUs, but the system does not assign all the available
> resources to all our tasks.
>
> What do you mean precisely here? Can you describe what you're seeing?
> Also, you have more than 800GB or RAM right?
>
>
> Yes, we have at least 2GBytes per CPU, and typically our resource table
> looks like:
>
> In this case 346/788 cpus are available and not assigned to any task, but
> we have more than 400 tasks waiting to be running.
>
> Checking the mesos-master log, it not make offers to all running
> frameworks all the time, just a few ones:
>
> I0303 00:16:01.964318 31791 master.cpp:6517] Sending 3 offers to framework
> 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0053 (Ejecucion: FRUS) at
> scheduler-52a267e9-30d1-4cc8-847e-fa7acfddf855@192.168.151.147:32899
> I0303 00:16:01.966234 31791 master.cpp:6517] Sending 5 offers to framework
> 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0072 (:izanami) at
> scheduler-ce746b8b-adac-4a0c-8310-5d312c9ed04f@192.168.151.186:44233
> I0303 00:16:01.968003 31791 master.cpp:6517] Sending 6 offers to framework
> 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0084 (vatmoutput) at
> scheduler-078b1978-840a-437e-a23e-5bca8c5e05c8@192.168.151.84:43023
> I0303 00:16:01.969828 31791 master.cpp:6517] Sending 6 offers to framework
> 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0081 (vatmoutput) at
> scheduler-d921e4bb-ee23-4e77-93d9-7742264839e5@192.168.151.84:43067
> I0303 00:16:01.971613 31791 master.cpp:6517] Sending 6 offers to framework
> c5299003-e29d-43cb-8ca7-887ab24c8513-0175 (:izanami) at
> scheduler-e10a1167-62d7-4ded-b932-792b5478ab61@192.168.151.186:38706
> I0303 00:16:01.973351 31791 master.cpp:6517] Sending 6 offers to framework
> 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0082 (vatmoutputg) at
> scheduler-c4db35be-41e1-45cb-8005-f0f7827a23d0@192.168.151.84:33668
> I0303 00:16:01.975126 31791 master.cpp:6517] Sending 6 offers to framework
> 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0062 (vatmvalidation) at
> scheduler-44ed1457-a752-4037-89b6-590221db3de5@192.168.151.84:33148
> I0303 00:16:01.976877 31791 master.cpp:6517] Sending 6 offers to framework
> 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0077 (:izanami) at
> scheduler-c648708f-32f3-44d5-9014-3fd0dbb461f7@192.168.151.186:35345
> I0303 00:16:01.978590 31791 master.cpp:6517] Sending 6 offers to framework
> 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0083 (vatmoutputg) at
> scheduler-fb965e89-5764-4a07-a94a-43de45babc7a@192.168.151.84:39218
>
> We have close to twice Frameworks running in this moment, one of them (not
> included) with more than 300 tasks waiting and just 100 cpus assigned (1
> cpu per task).
>
> The problem is (we think): the mesos-master does not offers resources to
> all the tasks all the time and the declined resources are not re-offered to
> other tasks. Any idea to how to change the behavior or the rate to offer
> resources to the tasks?
>
> FYI We set the --offer_timeout=1sec
>
> Thanks in advance.
>
> Harold Molina-Bulla Ph.D.
> On 02/03/2017 23:28, Benjamin Mahler wrote:
>
>
> Ben
>
> On Thu, Mar 2, 2017 at 9:00 AM, Harold Molina-Bulla 
> wrote:
>
> Hi Everybody,
>
> We are trying to develop an Scheduler in Python to distribute processes in
> a Mesos cluster.
>
> I have close to 800 CPUs, but the system does not assign all the available
> resources to all our tasks.
>
> In order to test, we are defining: 1 CPU, 1Gbyte RAM per process in order
> all the process fits on our machines. And launch several scripts
> simultaneous in order to have Nprocs > Ncpus (close 900 tasks in total).
>
> Our script is based on the test_framework.py example included in the Mesos
> src distribution, with changes like if the list of tasks to launch is
> empty, send an decline message.
>
> We have deployed Mesos 1.1.0.
>
> Any ideas in order the improvement the use of our resources?
>
> Thx in advance!
> Harold Molina-Bulla Ph.D.
> --
>
> *"En una época de mentira universal, decir la verdad constituye un acto
> revolucionario”*
> George Orwell (1984)
>
> Recuerda: PRISM te está vigilando!!! X)
> *Harold Molina-Bulla*
> Clave GnuPG: *189D5144*
>
>
>
> --
>
> *"En una época de mentira universal, decir la verdad constituye un acto
> revolucionario”*
> George Orwell (1984)
>
> Recuerda: PRISM te está vigilando!!! X)
> *Harold Molina-Bulla*
> *h.mol...@tsc.uc3m.es *
> 

Re: Messos do not assign all available resources

2017-03-02 Thread Benjamin Mahler
Can you upload the full logs somewhere and link to them here?

How many frameworks are you running? Do they all run in the "*" role?
Are the tasks short lived or long lived?
Can you update your test to not use the --offer_timeout? The intention of
that is to mitigate against frameworks that hold on to offers, but it
sounds like your frameworks decline.

On Thu, Mar 2, 2017 at 3:57 PM, Harold Molina-Bulla 
wrote:

> Hi,
>
> Thanks for your reply.
>
> Hi there, more clarification is needed:
>
>> I have close to 800 CPUs, but the system does not assign all the
>> available resources to all our tasks.
>>
> What do you mean precisely here? Can you describe what you're seeing?
> Also, you have more than 800GB or RAM right?
>
>
> Yes, we have at least 2GBytes per CPU, and typically our resource table
> looks like:
>
> In this case 346/788 cpus are available and not assigned to any task, but
> we have more than 400 tasks waiting to be running.
>
> Checking the mesos-master log, it not make offers to all running
> frameworks all the time, just a few ones:
>
> I0303 00:16:01.964318 31791 master.cpp:6517] Sending 3 offers to framework
> 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0053 (Ejecucion: FRUS) at
> scheduler-52a267e9-30d1-4cc8-847e-fa7acfddf855@192.168.151.147:32899
> I0303 00:16:01.966234 31791 master.cpp:6517] Sending 5 offers to framework
> 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0072 (:izanami) at
> scheduler-ce746b8b-adac-4a0c-8310-5d312c9ed04f@192.168.151.186:44233
> I0303 00:16:01.968003 31791 master.cpp:6517] Sending 6 offers to framework
> 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0084 (vatmoutput) at
> scheduler-078b1978-840a-437e-a23e-5bca8c5e05c8@192.168.151.84:43023
> I0303 00:16:01.969828 31791 master.cpp:6517] Sending 6 offers to framework
> 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0081 (vatmoutput) at
> scheduler-d921e4bb-ee23-4e77-93d9-7742264839e5@192.168.151.84:43067
> I0303 00:16:01.971613 31791 master.cpp:6517] Sending 6 offers to framework
> c5299003-e29d-43cb-8ca7-887ab24c8513-0175 (:izanami) at
> scheduler-e10a1167-62d7-4ded-b932-792b5478ab61@192.168.151.186:38706
> I0303 00:16:01.973351 31791 master.cpp:6517] Sending 6 offers to framework
> 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0082 (vatmoutputg) at
> scheduler-c4db35be-41e1-45cb-8005-f0f7827a23d0@192.168.151.84:33668
> I0303 00:16:01.975126 31791 master.cpp:6517] Sending 6 offers to framework
> 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0062 (vatmvalidation) at
> scheduler-44ed1457-a752-4037-89b6-590221db3de5@192.168.151.84:33148
> I0303 00:16:01.976877 31791 master.cpp:6517] Sending 6 offers to framework
> 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0077 (:izanami) at
> scheduler-c648708f-32f3-44d5-9014-3fd0dbb461f7@192.168.151.186:35345
> I0303 00:16:01.978590 31791 master.cpp:6517] Sending 6 offers to framework
> 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0083 (vatmoutputg) at
> scheduler-fb965e89-5764-4a07-a94a-43de45babc7a@192.168.151.84:39218
>
> We have close to twice Frameworks running in this moment, one of them (not
> included) with more than 300 tasks waiting and just 100 cpus assigned (1
> cpu per task).
>
> The problem is (we think): the mesos-master does not offers resources to
> all the tasks all the time and the declined resources are not re-offered to
> other tasks. Any idea to how to change the behavior or the rate to offer
> resources to the tasks?
>
> FYI We set the --offer_timeout=1sec
>
> Thanks in advance.
>
> Harold Molina-Bulla Ph.D.
> On 02/03/2017 23:28, Benjamin Mahler wrote:
>
>
> Ben
>
> On Thu, Mar 2, 2017 at 9:00 AM, Harold Molina-Bulla 
> wrote:
>
>> Hi Everybody,
>>
>> We are trying to develop an Scheduler in Python to distribute processes
>> in a Mesos cluster.
>>
>> I have close to 800 CPUs, but the system does not assign all the
>> available resources to all our tasks.
>>
>> In order to test, we are defining: 1 CPU, 1Gbyte RAM per process in order
>> all the process fits on our machines. And launch several scripts
>> simultaneous in order to have Nprocs > Ncpus (close 900 tasks in total).
>>
>> Our script is based on the test_framework.py example included in the
>> Mesos src distribution, with changes like if the list of tasks to launch is
>> empty, send an decline message.
>>
>> We have deployed Mesos 1.1.0.
>>
>> Any ideas in order the improvement the use of our resources?
>>
>> Thx in advance!
>> Harold Molina-Bulla Ph.D.
>> --
>>
>> *"En una época de mentira universal, decir la verdad constituye un acto
>> revolucionario”*
>> George Orwell (1984)
>>
>> Recuerda: PRISM te está vigilando!!! X)
>> *Harold Molina-Bulla*
>> Clave GnuPG: *189D5144*
>>
>
>
> --
>
> *"En una época de mentira universal, decir la verdad constituye un acto
> revolucionario”*
> George Orwell (1984)
>
> Recuerda: PRISM te está vigilando!!! X)
> *Harold Molina-Bulla*
> *h.mol...@tsc.uc3m.es *
> Clave GnuPG: *189D5144*
>


Re: Understanding Mesos Maintenance

2017-03-02 Thread Benjamin Mahler
Hey Zameer, great questions. Let us know if there's anything you think
could be improved or documented better.

Re 1:

The 'Viewing maintenance status' section of the documentation should
clarify this:
http://mesos.apache.org/documentation/latest/maintenance/

Re 2:

Both of these sound reasonable but the scheduler should not accept the
maintenance if it's not yet safe for the machine to be downed. Otherwise a
task failure may be mistakenly interpreted as a go ahead to down the
machine, despite the scheduler needing to get the task back running. If
expensive or long running work needs to finish (e.g. migrate data, replace
instances in a manner that doesn't violate SLA, etc.) then I would suggest
waiting until the work completes safely before accepting.

We likely need a third state like, TENTATIVELY_ACCEPT to signal to
operators / mesos that the framework intends to comply, but hasn't finished
whatever it needs to do yet for it to be safe to down the machine.

Also, one of the challenges here is when to take the action. Should the
scheduler prepare itself for maintenance as soon as it safely can? Or as
late (but not too late!) as it safely can? If the scheduler runs
long-running services, as soon as safely possible makes sense. If the
scheduler runs short running batch jobs, as late as safely possible
provides work-conservation.

Re 3:

The framework will receive another inverse offer if the framework still has
resources allocated on that agent. If receiving a regular offer for
available resources on the agent, an 'Unavailability' [1] will be included
if the machine is scheduled for maintenance, so that the scheduler can be
aware of the maintenance when placing new work.

Re 4:

It's not possible currently, and it's the operator's responsibility (the
intention was for "operator" to be maintenance tooling). Ideally we can add
automation of this decision into mesos, if decision criteria that is widely
applicable can be established (e.g. if nothing is running and all relevant
frameworks have accepted). Feel free to file a ticket for this or any other
improvements!

Ben

[1] https://github.com/apache/mesos/blob/8f487beb9f8aaed8f27b0404279b1a
2f97672ba1/include/mesos/v1/mesos.proto#L1416-L1426

On Wed, Mar 1, 2017 at 5:41 PM, Zameer Manji  wrote:

> Hey,
>
> I'm trying to understand some nuances of the maintenance API. Here are my
> questions:
>
> 1. The documentation mentions that accepting or declining and inverse
> offer is a "hint" to the operator. How do operators view if a framework has
> declined, accepted or ignored an inverse offer?
>
> 2. Should a framework accept an inverse offer and then start removing
> tasks from an agent or should the framework only accept the inverse offer
> after the removal of tasks is complete? I think the former makes sense, but
> it implies that operators need to poll the state of the agent to ensure
> there are no active tasks whereas the latter implies operators only need to
> check if all inverse offers were accepted.
>
> 3. After accepting the inverse offer, will a framework get another inverse
> offer for the same agent? Currently I'm trying to determine if inverse
> offer information needs to be persisted so a framework can continue it's
> draining work between failovers or if it can just wait for an inverse offer
> after starting up.
>
> 4. Is it possible for the agent to automatically transition from DRAIN to
> DOWN if at the start of the unavailability period the agent is free of
> tasks or is that still the operator's responsibility?
>
> --
> Zameer Manji
>


Re: Messos do not assign all available resources

2017-03-02 Thread Harold Molina-Bulla
Hi,

Thanks for your reply.

> Hi there, more clarification is needed: 
>
> I have close to 800 CPUs, but the system does not assign all the
> available resources to all our tasks.
>
> What do you mean precisely here? Can you describe what you're seeing?
> Also, you have more than 800GB or RAM right?
>

Yes, we have at least 2GBytes per CPU, and typically our resource table
looks like:

In this case 346/788 cpus are available and not assigned to any task,
but we have more than 400 tasks waiting to be running.

Checking the mesos-master log, it not make offers to all running
frameworks all the time, just a few ones:

> I0303 00:16:01.964318 31791 master.cpp:6517] Sending 3 offers to
> framework 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0053 (Ejecucion: FRUS)
> at scheduler-52a267e9-30d1-4cc8-847e-fa7acfddf855@192.168.151.147:32899
> I0303 00:16:01.966234 31791 master.cpp:6517] Sending 5 offers to
> framework 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0072 (:izanami) at
> scheduler-ce746b8b-adac-4a0c-8310-5d312c9ed04f@192.168.151.186:44233
> I0303 00:16:01.968003 31791 master.cpp:6517] Sending 6 offers to
> framework 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0084 (vatmoutput) at
> scheduler-078b1978-840a-437e-a23e-5bca8c5e05c8@192.168.151.84:43023
> I0303 00:16:01.969828 31791 master.cpp:6517] Sending 6 offers to
> framework 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0081 (vatmoutput) at
> scheduler-d921e4bb-ee23-4e77-93d9-7742264839e5@192.168.151.84:43067
> I0303 00:16:01.971613 31791 master.cpp:6517] Sending 6 offers to
> framework c5299003-e29d-43cb-8ca7-887ab24c8513-0175 (:izanami) at
> scheduler-e10a1167-62d7-4ded-b932-792b5478ab61@192.168.151.186:38706
> I0303 00:16:01.973351 31791 master.cpp:6517] Sending 6 offers to
> framework 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0082 (vatmoutputg) at
> scheduler-c4db35be-41e1-45cb-8005-f0f7827a23d0@192.168.151.84:33668
> I0303 00:16:01.975126 31791 master.cpp:6517] Sending 6 offers to
> framework 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0062 (vatmvalidation)
> at scheduler-44ed1457-a752-4037-89b6-590221db3de5@192.168.151.84:33148
> I0303 00:16:01.976877 31791 master.cpp:6517] Sending 6 offers to
> framework 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0077 (:izanami) at
> scheduler-c648708f-32f3-44d5-9014-3fd0dbb461f7@192.168.151.186:35345
> I0303 00:16:01.978590 31791 master.cpp:6517] Sending 6 offers to
> framework 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0083 (vatmoutputg) at
> scheduler-fb965e89-5764-4a07-a94a-43de45babc7a@192.168.151.84:39218
We have close to twice Frameworks running in this moment, one of them
(not included) with more than 300 tasks waiting and just 100 cpus
assigned (1 cpu per task).

The problem is (we think): the mesos-master does not offers resources to
all the tasks all the time and the declined resources are not re-offered
to other tasks. Any idea to how to change the behavior or the rate to
offer resources to the tasks?

FYI We set the --offer_timeout=1sec

Thanks in advance.

Harold Molina-Bulla Ph.D.

On 02/03/2017 23:28, Benjamin Mahler wrote:
>
> Ben
>
> On Thu, Mar 2, 2017 at 9:00 AM, Harold Molina-Bulla
> > wrote:
>
> Hi Everybody,
>
> We are trying to develop an Scheduler in Python to distribute
> processes in a Mesos cluster.
>
> I have close to 800 CPUs, but the system does not assign all the
> available resources to all our tasks.
>
> In order to test, we are defining: 1 CPU, 1Gbyte RAM per process
> in order all the process fits on our machines. And launch several
> scripts simultaneous in order to have Nprocs > Ncpus (close 900
> tasks in total).
>
> Our script is based on the test_framework.py example included in
> the Mesos src distribution, with changes like if the list of tasks
> to launch is empty, send an decline message.
>
> We have deployed Mesos 1.1.0.
>
> Any ideas in order the improvement the use of our resources?
>
> Thx in advance!
>
> Harold Molina-Bulla Ph.D.
> -- 
>
> /"En una época de mentira universal, decir la verdad constituye un
> acto revolucionario”/
> George Orwell (1984)
>
> Recuerda: PRISM te está vigilando!!! X)
>
> *Harold Molina-Bulla*
> Clave GnuPG: *189D5144*
>
>

-- 

/"En una época de mentira universal, decir la verdad constituye un acto
revolucionario”/
George Orwell (1984)

Recuerda: PRISM te está vigilando!!! X)

*Harold Molina-Bulla*
/h.mol...@tsc.uc3m.es/
Clave GnuPG: *189D5144*


smime.p7s
Description: S/MIME Cryptographic Signature


Re: Messos do not assign all available resources

2017-03-02 Thread Benjamin Mahler
Hi there, more clarification is needed:

> I have close to 800 CPUs, but the system does not assign all the available
> resources to all our tasks.
>
What do you mean precisely here? Can you describe what you're seeing?
Also, you have more than 800GB or RAM right?

Ben

On Thu, Mar 2, 2017 at 9:00 AM, Harold Molina-Bulla 
wrote:

> Hi Everybody,
>
> We are trying to develop an Scheduler in Python to distribute processes in
> a Mesos cluster.
>
> I have close to 800 CPUs, but the system does not assign all the available
> resources to all our tasks.
>
> In order to test, we are defining: 1 CPU, 1Gbyte RAM per process in order
> all the process fits on our machines. And launch several scripts
> simultaneous in order to have Nprocs > Ncpus (close 900 tasks in total).
>
> Our script is based on the test_framework.py example included in the Mesos
> src distribution, with changes like if the list of tasks to launch is
> empty, send an decline message.
>
> We have deployed Mesos 1.1.0.
>
> Any ideas in order the improvement the use of our resources?
>
> Thx in advance!
> Harold Molina-Bulla Ph.D.
> --
>
> *"En una época de mentira universal, decir la verdad constituye un acto
> revolucionario”*
> George Orwell (1984)
>
> Recuerda: PRISM te está vigilando!!! X)
> *Harold Molina-Bulla*
> Clave GnuPG: *189D5144*
>


Re: Welcome Kevin Klues as a Mesos Committer and PMC member!

2017-03-02 Thread Yan Xu
Congrats Kevin!

---
Jiang Yan Xu  | @xujyan 

On Wed, Mar 1, 2017 at 2:05 PM, Benjamin Mahler  wrote:

> Hi all,
>
> Please welcome Kevin Klues as the newest committer and PMC member of the
> Apache Mesos project.
>
> Kevin has been an active contributor in the project for over a year, and in
> this time he made a number of contributions to the project: Nvidia GPU
> support [1], the containerization side of POD support (new container init
> process), and support for "attach" and "exec" of commands within running
> containers [2].
>
> Also, Kevin took on an effort with Haris Choudhary to revive the CLI [3]
> via a better structured python implementation (to be more accessible to
> contributors) and a more extensible architecture to better support adding
> new or custom subcommands. The work also adds a unit test framework for the
> CLI functionality (we had no tests previously!). I think it's great that
> Kevin took on this much needed improvement with Haris, and I'm very much
> looking forward to seeing this land in the project.
>
> Here is his committer eligibility document for perusal:
> https://docs.google.com/document/d/1mlO1yyLCoCSd85XeDKIxTYyboK_
> uiOJ4Uwr6ruKTlFM/edit
>
> Thanks!
> Ben
>
> [1] http://mesos.apache.org/documentation/latest/gpu-support/
> [2]
> https://docs.google.com/document/d/1nAVr0sSSpbDLrgUlAEB5hKzCl482N
> SVk8V0D56sFMzU
> [3]
> https://docs.google.com/document/d/1r6Iv4Efu8v8IBrcUTjgYkvZ32WVsc
> gYqrD07OyIglsA/
>


Messos do not assign all available resources

2017-03-02 Thread Harold Molina-Bulla
Hi Everybody,

We are trying to develop an Scheduler in Python to distribute processes
in a Mesos cluster.

I have close to 800 CPUs, but the system does not assign all the
available resources to all our tasks.

In order to test, we are defining: 1 CPU, 1Gbyte RAM per process in
order all the process fits on our machines. And launch several scripts
simultaneous in order to have Nprocs > Ncpus (close 900 tasks in total).

Our script is based on the test_framework.py example included in the
Mesos src distribution, with changes like if the list of tasks to launch
is empty, send an decline message.

We have deployed Mesos 1.1.0.

Any ideas in order the improvement the use of our resources?

Thx in advance!

Harold Molina-Bulla Ph.D.
-- 

/"En una época de mentira universal, decir la verdad constituye un acto
revolucionario”/
George Orwell (1984)

Recuerda: PRISM te está vigilando!!! X)

*Harold Molina-Bulla*
Clave GnuPG: *189D5144*


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [VOTE] Release Apache Mesos 1.2.0 (rc2)

2017-03-02 Thread Adam Bordelon
TL;DR: No consensus yet. Let's extend the vote for a day or two, until we
have 3 +1s or a legit -1.
During that time we can test further, and investigate any issues that have
shown up.

Here's a summary of what's been reported on the 1.2.0-rc2 vote thread:

- There was a perf core dump on ASF CI, which is not necessarily a blocker:
MESOS-7160  Parsing of perf version segfaults
  Perhaps fixed by backporting MESOS-6982: PerfTest.Version fails on recent
Arch Linux

- There were a couple of (known/unsurprising) flaky tests:
MESOS-7185
DockerRuntimeIsolatorTest.ROOT_INTERNET_CURL_DockerDefaultEntryptRegistryPuller
is flaky
MESOS-4570  DockerFetcherPluginTest.INTERNET_CURL_FetchImage seems flaky.

- If we were to have an rc3, the following Critical bugs could be included:
MESOS-7050  IOSwitchboard FDs leaked when containerizer launch fails --
leads to deadlock
MESOS-6982  PerfTest.Version fails on recent Arch Linux

- Plus doc updates:
MESOS-7188 Add documentation for Debug APIs to Operator API doc
MESOS-7189 Add nested container launch/wait/kill APIs to agent API docs.


On Wed, Mar 1, 2017 at 11:30 AM, Neil Conway  wrote:

> The perf core dump might be addressed if we backport this change:
>
> https://reviews.apache.org/r/56611/
>
> Although my guess is that this isn't a severe problem: for some
> as-yet-unknown reason, running `perf` on the host segfaulted, which
> causes the test to fail.
>
> Neil
>
> On Wed, Mar 1, 2017 at 11:09 AM, Vinod Kone  wrote:
> > Tested on ASF CI.
> >
> > Saw 2 configurations fail. One was the perf core dump issue
> > . Other is a known
> (since
> > 0..28.0) flaky test with Docker fetcher plugin
> > .
> >
> > Withholding the vote until we know the severity of the perf core dump.
> >
> >
> > *Revision*: b9d8202a7444d0d1e49476bfc9817eb4583beaff
> >
> >- refs/tags/1.1.1-rc2
> >
> > Configuration Matrix gcc clang
> > centos:7 --verbose --enable-libevent --enable-ssl autotools
> > [image: Success]
> >  Release/30/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--
> enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%
> 20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%
> 7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> > [image: Not run]
> > cmake
> > [image: Success]
> >  Release/30/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--
> verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=
> GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%
> 7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> > [image: Not run]
> > --verbose autotools
> > [image: Success]
> >  Release/30/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,
> ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_
> exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> > [image: Not run]
> > cmake
> > [image: Success]
> >  Release/30/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--
> verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%
> 3A7,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> > [image: Not run]
> > ubuntu:14.04 --verbose --enable-libevent --enable-ssl autotools
> > [image: Success]
> >  Release/30/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--
> enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%
> 20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(docker%7C%
> 7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> > [image: Failed]
> >  Release/30/BUILDTOOL=autotools,COMPILER=clang,CONFIGURATION=--verbose%20--
> enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%
> 20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(docker%7C%
> 7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> > cmake
> > [image: Success]
> >  Release/30/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--
> verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=
> GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(
> docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> > [image: Success]
> >  Release/30/BUILDTOOL=cmake,COMPILER=clang,CONFIGURATION=-
> -verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=
> GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(
> docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> > --verbose autotools
> > [image: Success]
> >  Release/30/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,
> ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,
>