What Gabriel is alluding to is a situation where you have:

* Frameworks with lower shares that do not want additional resources, and
* Frameworks with a higher shares that want additional resources.

If there are a sufficient number of frameworks, it's possible for the
decline filters of the low share frameworks to expire before we get a
chance to offer resources to the high share frameworks. In this case, we
are stuck offering to the low share frameworks and never get a chance to
offer to the high share frameworks.

I can't tell yet if this is what is occurring in your setup, but the
recommendation is to update the scheduler to make a SUPPRESS call to tell
mesos it does not want any more resources (and REVIVE later if it wants
resources). In your case that means that once the task list is emptied, you
should send a SUPPRESS message.

Ben



On Thu, Mar 2, 2017 at 4:33 PM, Gabriel Hartmann <gabr...@mesosphere.io>
wrote:

> Possibly the suppress/revive problem.
>
> On Thu, Mar 2, 2017 at 4:30 PM Benjamin Mahler <bmah...@apache.org> wrote:
>
>> Can you upload the full logs somewhere and link to them here?
>>
>> How many frameworks are you running? Do they all run in the "*" role?
>> Are the tasks short lived or long lived?
>> Can you update your test to not use the --offer_timeout? The intention of
>> that is to mitigate against frameworks that hold on to offers, but it
>> sounds like your frameworks decline.
>>
>> On Thu, Mar 2, 2017 at 3:57 PM, Harold Molina-Bulla <h.mol...@tsc.uc3m.es
>> > wrote:
>>
>> Hi,
>>
>> Thanks for your reply.
>>
>> Hi there, more clarification is needed:
>>
>> I have close to 800 CPUs, but the system does not assign all the
>> available resources to all our tasks.
>>
>> What do you mean precisely here? Can you describe what you're seeing?
>> Also, you have more than 800GB or RAM right?
>>
>>
>> Yes, we have at least 2GBytes per CPU, and typically our resource table
>> looks like:
>>
>> In this case 346/788 cpus are available and not assigned to any task, but
>> we have more than 400 tasks waiting to be running.
>>
>> Checking the mesos-master log, it not make offers to all running
>> frameworks all the time, just a few ones:
>>
>> I0303 00:16:01.964318 31791 master.cpp:6517] Sending 3 offers to
>> framework 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0053 (Ejecucion: FRUS) at
>> scheduler-52a267e9-30d1-4cc8-847e-fa7acfddf855@192.168.151.147:32899
>> I0303 00:16:01.966234 31791 master.cpp:6517] Sending 5 offers to
>> framework 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0072 (:izanami) at
>> scheduler-ce746b8b-adac-4a0c-8310-5d312c9ed04f@192.168.151.186:44233
>> I0303 00:16:01.968003 31791 master.cpp:6517] Sending 6 offers to
>> framework 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0084 (vatmoutput) at
>> scheduler-078b1978-840a-437e-a23e-5bca8c5e05c8@192.168.151.84:43023
>> I0303 00:16:01.969828 31791 master.cpp:6517] Sending 6 offers to
>> framework 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0081 (vatmoutput) at
>> scheduler-d921e4bb-ee23-4e77-93d9-7742264839e5@192.168.151.84:43067
>> I0303 00:16:01.971613 31791 master.cpp:6517] Sending 6 offers to
>> framework c5299003-e29d-43cb-8ca7-887ab24c8513-0175 (:izanami) at
>> scheduler-e10a1167-62d7-4ded-b932-792b5478ab61@192.168.151.186:38706
>> I0303 00:16:01.973351 31791 master.cpp:6517] Sending 6 offers to
>> framework 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0082 (vatmoutputg) at
>> scheduler-c4db35be-41e1-45cb-8005-f0f7827a23d0@192.168.151.84:33668
>> I0303 00:16:01.975126 31791 master.cpp:6517] Sending 6 offers to
>> framework 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0062 (vatmvalidation) at
>> scheduler-44ed1457-a752-4037-89b6-590221db3de5@192.168.151.84:33148
>> I0303 00:16:01.976877 31791 master.cpp:6517] Sending 6 offers to
>> framework 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0077 (:izanami) at
>> scheduler-c648708f-32f3-44d5-9014-3fd0dbb461f7@192.168.151.186:35345
>> I0303 00:16:01.978590 31791 master.cpp:6517] Sending 6 offers to
>> framework 4d896f23-1ce0-46d6-ae0f-acbe23f2a38c-0083 (vatmoutputg) at
>> scheduler-fb965e89-5764-4a07-a94a-43de45babc7a@192.168.151.84:39218
>>
>> We have close to twice Frameworks running in this moment, one of them
>> (not included) with more than 300 tasks waiting and just 100 cpus assigned
>> (1 cpu per task).
>>
>> The problem is (we think): the mesos-master does not offers resources to
>> all the tasks all the time and the declined resources are not re-offered to
>> other tasks. Any idea to how to change the behavior or the rate to offer
>> resources to the tasks?
>>
>> FYI We set the --offer_timeout=1sec
>>
>> Thanks in advance.
>>
>> Harold Molina-Bulla Ph.D.
>> On 02/03/2017 23:28, Benjamin Mahler wrote:
>>
>>
>> Ben
>>
>> On Thu, Mar 2, 2017 at 9:00 AM, Harold Molina-Bulla <h.mol...@tsc.uc3m.es
>> > wrote:
>>
>> Hi Everybody,
>>
>> We are trying to develop an Scheduler in Python to distribute processes
>> in a Mesos cluster.
>>
>> I have close to 800 CPUs, but the system does not assign all the
>> available resources to all our tasks.
>>
>> In order to test, we are defining: 1 CPU, 1Gbyte RAM per process in order
>> all the process fits on our machines. And launch several scripts
>> simultaneous in order to have Nprocs > Ncpus (close 900 tasks in total).
>>
>> Our script is based on the test_framework.py example included in the
>> Mesos src distribution, with changes like if the list of tasks to launch is
>> empty, send an decline message.
>>
>> We have deployed Mesos 1.1.0.
>>
>> Any ideas in order the improvement the use of our resources?
>>
>> Thx in advance!
>> Harold Molina-Bulla Ph.D.
>> --
>>
>> *"En una época de mentira universal, decir la verdad constituye un acto
>> revolucionario”*
>> George Orwell (1984)
>>
>> Recuerda: PRISM te está vigilando!!! X)
>> *Harold Molina-Bulla*
>> Clave GnuPG: *189D5144*
>>
>>
>>
>> --
>>
>> *"En una época de mentira universal, decir la verdad constituye un acto
>> revolucionario”*
>> George Orwell (1984)
>>
>> Recuerda: PRISM te está vigilando!!! X)
>> *Harold Molina-Bulla*
>> *h.mol...@tsc.uc3m.es <h.mol...@tsc.uc3m.es>*
>> Clave GnuPG: *189D5144*
>>
>>
>>

Reply via email to