Re: [GPU] [Allocation] "Scarce" Resource Allocation

2016-06-22 Thread Elizabeth Lingg
I see. This is good to know for current development work. Thanks for 
clarifying, Guangya and Kevin.

 Elizabeth Lingg


> On Jun 22, 2016, at 3:02 AM, Guangya Liu  wrote:
> 
> Hi Elizabeth,
> 
> Just FYI, there is a JIRA tracing the resource revocation here
> https://issues.apache.org/jira/browse/MESOS-4967
> 
> And I'm also working on the short term solution of excluding the scarce
> resources from allocator (https://reviews.apache.org/r/48906/), with this
> feature and Kevin's GPU_RESOURCES capability, the mesos can handle scarce
> resources well.
> 
> Thanks,
> 
> Guangya
> 
> On Wed, Jun 22, 2016 at 4:45 AM, Kevin Klues  wrote:
> 
>> As an FYI, preliminary support to work around this issue for GPUs will
>> appear in the 1.0 release
>> https://reviews.apache.org/r/48914/
>> 
>> This doesn't solve the problem of scarce resources in general, but it
>> will at least keep non-GPU workloads from starving out GPU-based
>> workloads on GPU capable machines. The downside of this approach is
>> that only GPU aware frameworks will be able to launch stuff on GPU
>> capable machines (meaning some of their resources could go unused
>> unnecessarily).  We decided this tradeoff is acceptable for now.
>> 
>> Kevin
>> 
>> On Tue, Jun 21, 2016 at 1:40 PM, Elizabeth Lingg
>>  wrote:
>>> Thanks, looking forward to discussion and review on your document. The
>> main use case I see here is that some of our frameworks will want to
>> request the GPU resources, and we want to make sure that those frameworks
>> are able to successfully launch tasks on agents with those resources. We
>> want to be certain that other frameworks that do not require GPU’s will not
>> request all other resources on those agents (i.e. cpu, disk, memory) which
>> would mean the GPU resources are not allocated and the frameworks that
>> require them will not receive them. As Ben Mahler mentioned, "(2) Because
>> we do not have revocation yet, if a framework decides to consume the
>> non-GPU resources on a GPU machine, it will prevent the GPU workloads from
>> running!” This will occur for us in clusters where we have higher
>> utilization as well as different types of workloads running. Smart task
>> placement then becomes more relevant (i.e. we want to be able to schedule
>> with scarce resources successfully and we may have considerations like not
>> scheduling too many I/O bound workloads on a single host or more stringent
>> requirements for scheduling persistent tasks).
>>> 
>>>  Elizabeth Lingg
>>> 
>>> 
>>> 
 On Jun 20, 2016, at 7:24 PM, Guangya Liu  wrote:
 
 Had some discussion with Ben M, for the following two solutions:
 
 1) Ben M: Create sub-pools of resources based on machine profile and
 perform fair sharing / quota within each pool plus a framework
 capability GPU_AWARE
 to enable allocator filter out scarce resources for some frameworks.
 2) Guangya: Adding new sorters for non scarce resources plus a framework
 capability GPU_AWARE to enable allocator filter out scarce resources for
 some frameworks.
 
 Both of the above two solutions are meaning same thing and there is no
 difference between those two solutions: Create sub-pools of resources
>> will
 need to introduce different sorters for each sub-pools, so I will merge
 those two solutions to one.
 
 Also had some dicsussion with Ben for AlexR's solution of implementing
 "requestResource", this API should be treated as an improvement to the
 issues of doing resource allocation pessimistically. (e.g. we
>> offer/decline
 the GPUs to 1000 frameworks before offering it to the GPU framework that
 wants it). And the "requestResource" is providing *more information* to
 mesos. Namely, it gives us awareness of demand.
 
 Even though for some cases, we can use the "requestResource" to get all
>> of
 the scarce resources, and then once those scarce resources are in use,
>> then
 the WDRF sorter will sorter non scarce resources as normal, but the
>> problem
 is that we cannot guarantee that the framework which have
>> "requestResource"
 can always consume all of the scarce resources before those scarce
>> resource
 allocated to other frameworks.
 
 I'm planning to draft a document based on solution 1) "Create sub-pools"
 for the long term solution, any comments are welcome!
 
 Thanks,
 
 Guangya
 
 On Sat, Jun 18, 2016 at 11:58 AM, Guangya Liu 
>> wrote:
 
> Thanks Du Fan. So you mean that we should have some clear rules in
> document or somewhere else to tell or guide cluster admin which
>> resources
> should be classified as scarce resources, right?
> 
> On Sat, Jun 18, 2016 at 2:38 AM, Du, Fan  wrote:
> 
>> 
>> 
>> On 2016/6/17 7:57, Guangya Liu wrote:
>> 
>>> 

Re: [GPU] [Allocation] "Scarce" Resource Allocation

2016-06-22 Thread Guangya Liu
Hi Elizabeth,

Just FYI, there is a JIRA tracing the resource revocation here
https://issues.apache.org/jira/browse/MESOS-4967

And I'm also working on the short term solution of excluding the scarce
resources from allocator (https://reviews.apache.org/r/48906/), with this
feature and Kevin's GPU_RESOURCES capability, the mesos can handle scarce
resources well.

Thanks,

Guangya

On Wed, Jun 22, 2016 at 4:45 AM, Kevin Klues  wrote:

> As an FYI, preliminary support to work around this issue for GPUs will
> appear in the 1.0 release
> https://reviews.apache.org/r/48914/
>
> This doesn't solve the problem of scarce resources in general, but it
> will at least keep non-GPU workloads from starving out GPU-based
> workloads on GPU capable machines. The downside of this approach is
> that only GPU aware frameworks will be able to launch stuff on GPU
> capable machines (meaning some of their resources could go unused
> unnecessarily).  We decided this tradeoff is acceptable for now.
>
> Kevin
>
> On Tue, Jun 21, 2016 at 1:40 PM, Elizabeth Lingg
>  wrote:
> > Thanks, looking forward to discussion and review on your document. The
> main use case I see here is that some of our frameworks will want to
> request the GPU resources, and we want to make sure that those frameworks
> are able to successfully launch tasks on agents with those resources. We
> want to be certain that other frameworks that do not require GPU’s will not
> request all other resources on those agents (i.e. cpu, disk, memory) which
> would mean the GPU resources are not allocated and the frameworks that
> require them will not receive them. As Ben Mahler mentioned, "(2) Because
> we do not have revocation yet, if a framework decides to consume the
> non-GPU resources on a GPU machine, it will prevent the GPU workloads from
> running!” This will occur for us in clusters where we have higher
> utilization as well as different types of workloads running. Smart task
> placement then becomes more relevant (i.e. we want to be able to schedule
> with scarce resources successfully and we may have considerations like not
> scheduling too many I/O bound workloads on a single host or more stringent
> requirements for scheduling persistent tasks).
> >
> >  Elizabeth Lingg
> >
> >
> >
> >> On Jun 20, 2016, at 7:24 PM, Guangya Liu  wrote:
> >>
> >> Had some discussion with Ben M, for the following two solutions:
> >>
> >> 1) Ben M: Create sub-pools of resources based on machine profile and
> >> perform fair sharing / quota within each pool plus a framework
> >> capability GPU_AWARE
> >> to enable allocator filter out scarce resources for some frameworks.
> >> 2) Guangya: Adding new sorters for non scarce resources plus a framework
> >> capability GPU_AWARE to enable allocator filter out scarce resources for
> >> some frameworks.
> >>
> >> Both of the above two solutions are meaning same thing and there is no
> >> difference between those two solutions: Create sub-pools of resources
> will
> >> need to introduce different sorters for each sub-pools, so I will merge
> >> those two solutions to one.
> >>
> >> Also had some dicsussion with Ben for AlexR's solution of implementing
> >> "requestResource", this API should be treated as an improvement to the
> >> issues of doing resource allocation pessimistically. (e.g. we
> offer/decline
> >> the GPUs to 1000 frameworks before offering it to the GPU framework that
> >> wants it). And the "requestResource" is providing *more information* to
> >> mesos. Namely, it gives us awareness of demand.
> >>
> >> Even though for some cases, we can use the "requestResource" to get all
> of
> >> the scarce resources, and then once those scarce resources are in use,
> then
> >> the WDRF sorter will sorter non scarce resources as normal, but the
> problem
> >> is that we cannot guarantee that the framework which have
> "requestResource"
> >> can always consume all of the scarce resources before those scarce
> resource
> >> allocated to other frameworks.
> >>
> >> I'm planning to draft a document based on solution 1) "Create sub-pools"
> >> for the long term solution, any comments are welcome!
> >>
> >> Thanks,
> >>
> >> Guangya
> >>
> >> On Sat, Jun 18, 2016 at 11:58 AM, Guangya Liu 
> wrote:
> >>
> >>> Thanks Du Fan. So you mean that we should have some clear rules in
> >>> document or somewhere else to tell or guide cluster admin which
> resources
> >>> should be classified as scarce resources, right?
> >>>
> >>> On Sat, Jun 18, 2016 at 2:38 AM, Du, Fan  wrote:
> >>>
> 
> 
>  On 2016/6/17 7:57, Guangya Liu wrote:
> 
> > @Fan Du,
> >
> > Currently, I think that the scarce resources should be defined by
> cluster
> > admin, s/he can specify those scarce resources via a flag when master
> > start
> > up.
> >
> 
>  This is not what I mean.
>  IMO, it's not cluster admin's 

Re: [GPU] [Allocation] "Scarce" Resource Allocation

2016-06-21 Thread Kevin Klues
As an FYI, preliminary support to work around this issue for GPUs will
appear in the 1.0 release
https://reviews.apache.org/r/48914/

This doesn't solve the problem of scarce resources in general, but it
will at least keep non-GPU workloads from starving out GPU-based
workloads on GPU capable machines. The downside of this approach is
that only GPU aware frameworks will be able to launch stuff on GPU
capable machines (meaning some of their resources could go unused
unnecessarily).  We decided this tradeoff is acceptable for now.

Kevin

On Tue, Jun 21, 2016 at 1:40 PM, Elizabeth Lingg
 wrote:
> Thanks, looking forward to discussion and review on your document. The main 
> use case I see here is that some of our frameworks will want to request the 
> GPU resources, and we want to make sure that those frameworks are able to 
> successfully launch tasks on agents with those resources. We want to be 
> certain that other frameworks that do not require GPU’s will not request all 
> other resources on those agents (i.e. cpu, disk, memory) which would mean the 
> GPU resources are not allocated and the frameworks that require them will not 
> receive them. As Ben Mahler mentioned, "(2) Because we do not have revocation 
> yet, if a framework decides to consume the non-GPU resources on a GPU 
> machine, it will prevent the GPU workloads from running!” This will occur for 
> us in clusters where we have higher utilization as well as different types of 
> workloads running. Smart task placement then becomes more relevant (i.e. we 
> want to be able to schedule with scarce resources successfully and we may 
> have considerations like not scheduling too many I/O bound workloads on a 
> single host or more stringent requirements for scheduling persistent tasks).
>
>  Elizabeth Lingg
>
>
>
>> On Jun 20, 2016, at 7:24 PM, Guangya Liu  wrote:
>>
>> Had some discussion with Ben M, for the following two solutions:
>>
>> 1) Ben M: Create sub-pools of resources based on machine profile and
>> perform fair sharing / quota within each pool plus a framework
>> capability GPU_AWARE
>> to enable allocator filter out scarce resources for some frameworks.
>> 2) Guangya: Adding new sorters for non scarce resources plus a framework
>> capability GPU_AWARE to enable allocator filter out scarce resources for
>> some frameworks.
>>
>> Both of the above two solutions are meaning same thing and there is no
>> difference between those two solutions: Create sub-pools of resources will
>> need to introduce different sorters for each sub-pools, so I will merge
>> those two solutions to one.
>>
>> Also had some dicsussion with Ben for AlexR's solution of implementing
>> "requestResource", this API should be treated as an improvement to the
>> issues of doing resource allocation pessimistically. (e.g. we offer/decline
>> the GPUs to 1000 frameworks before offering it to the GPU framework that
>> wants it). And the "requestResource" is providing *more information* to
>> mesos. Namely, it gives us awareness of demand.
>>
>> Even though for some cases, we can use the "requestResource" to get all of
>> the scarce resources, and then once those scarce resources are in use, then
>> the WDRF sorter will sorter non scarce resources as normal, but the problem
>> is that we cannot guarantee that the framework which have "requestResource"
>> can always consume all of the scarce resources before those scarce resource
>> allocated to other frameworks.
>>
>> I'm planning to draft a document based on solution 1) "Create sub-pools"
>> for the long term solution, any comments are welcome!
>>
>> Thanks,
>>
>> Guangya
>>
>> On Sat, Jun 18, 2016 at 11:58 AM, Guangya Liu  wrote:
>>
>>> Thanks Du Fan. So you mean that we should have some clear rules in
>>> document or somewhere else to tell or guide cluster admin which resources
>>> should be classified as scarce resources, right?
>>>
>>> On Sat, Jun 18, 2016 at 2:38 AM, Du, Fan  wrote:
>>>


 On 2016/6/17 7:57, Guangya Liu wrote:

> @Fan Du,
>
> Currently, I think that the scarce resources should be defined by cluster
> admin, s/he can specify those scarce resources via a flag when master
> start
> up.
>

 This is not what I mean.
 IMO, it's not cluster admin's call to decide what resources should be
 marked as scarce , they can carry out the operation, but should be advised
 on based on the clear rule: to what extend the resource is scarce compared
 with other resources, and it will affect wDRF by causing starvation for
 frameworks which holds scarce resources, that's my point.

 To my best knowledge here, a quantitative study of how wDRF behaves in
 scenario of one/multiple scarce resources first will help to verify the
 proposed approach, and guide the user of this functionality.



 Regarding to the proposal of generic 

Re: [GPU] [Allocation] "Scarce" Resource Allocation

2016-06-21 Thread Elizabeth Lingg
Thanks, looking forward to discussion and review on your document. The main use 
case I see here is that some of our frameworks will want to request the GPU 
resources, and we want to make sure that those frameworks are able to 
successfully launch tasks on agents with those resources. We want to be certain 
that other frameworks that do not require GPU’s will not request all other 
resources on those agents (i.e. cpu, disk, memory) which would mean the GPU 
resources are not allocated and the frameworks that require them will not 
receive them. As Ben Mahler mentioned, "(2) Because we do not have revocation 
yet, if a framework decides to consume the non-GPU resources on a GPU machine, 
it will prevent the GPU workloads from running!” This will occur for us in 
clusters where we have higher utilization as well as different types of 
workloads running. Smart task placement then becomes more relevant (i.e. we 
want to be able to schedule with scarce resources successfully and we may have 
considerations like not scheduling too many I/O bound workloads on a single 
host or more stringent requirements for scheduling persistent tasks).

 Elizabeth Lingg



> On Jun 20, 2016, at 7:24 PM, Guangya Liu  wrote:
> 
> Had some discussion with Ben M, for the following two solutions:
> 
> 1) Ben M: Create sub-pools of resources based on machine profile and
> perform fair sharing / quota within each pool plus a framework
> capability GPU_AWARE
> to enable allocator filter out scarce resources for some frameworks.
> 2) Guangya: Adding new sorters for non scarce resources plus a framework
> capability GPU_AWARE to enable allocator filter out scarce resources for
> some frameworks.
> 
> Both of the above two solutions are meaning same thing and there is no
> difference between those two solutions: Create sub-pools of resources will
> need to introduce different sorters for each sub-pools, so I will merge
> those two solutions to one.
> 
> Also had some dicsussion with Ben for AlexR's solution of implementing
> "requestResource", this API should be treated as an improvement to the
> issues of doing resource allocation pessimistically. (e.g. we offer/decline
> the GPUs to 1000 frameworks before offering it to the GPU framework that
> wants it). And the "requestResource" is providing *more information* to
> mesos. Namely, it gives us awareness of demand.
> 
> Even though for some cases, we can use the "requestResource" to get all of
> the scarce resources, and then once those scarce resources are in use, then
> the WDRF sorter will sorter non scarce resources as normal, but the problem
> is that we cannot guarantee that the framework which have "requestResource"
> can always consume all of the scarce resources before those scarce resource
> allocated to other frameworks.
> 
> I'm planning to draft a document based on solution 1) "Create sub-pools"
> for the long term solution, any comments are welcome!
> 
> Thanks,
> 
> Guangya
> 
> On Sat, Jun 18, 2016 at 11:58 AM, Guangya Liu  wrote:
> 
>> Thanks Du Fan. So you mean that we should have some clear rules in
>> document or somewhere else to tell or guide cluster admin which resources
>> should be classified as scarce resources, right?
>> 
>> On Sat, Jun 18, 2016 at 2:38 AM, Du, Fan  wrote:
>> 
>>> 
>>> 
>>> On 2016/6/17 7:57, Guangya Liu wrote:
>>> 
 @Fan Du,
 
 Currently, I think that the scarce resources should be defined by cluster
 admin, s/he can specify those scarce resources via a flag when master
 start
 up.
 
>>> 
>>> This is not what I mean.
>>> IMO, it's not cluster admin's call to decide what resources should be
>>> marked as scarce , they can carry out the operation, but should be advised
>>> on based on the clear rule: to what extend the resource is scarce compared
>>> with other resources, and it will affect wDRF by causing starvation for
>>> frameworks which holds scarce resources, that's my point.
>>> 
>>> To my best knowledge here, a quantitative study of how wDRF behaves in
>>> scenario of one/multiple scarce resources first will help to verify the
>>> proposed approach, and guide the user of this functionality.
>>> 
>>> 
>>> 
>>> Regarding to the proposal of generic scarce resources, do you have any
 thoughts on this? I can see that giving framework developers the options
 of
 define scarce resources may bring trouble to mesos, it is better to let
 mesos define those scarce but not framework developer.
 
>>> 
>> 



Re: [GPU] [Allocation] "Scarce" Resource Allocation

2016-06-20 Thread Guangya Liu
Had some discussion with Ben M, for the following two solutions:

1) Ben M: Create sub-pools of resources based on machine profile and
perform fair sharing / quota within each pool plus a framework
capability GPU_AWARE
to enable allocator filter out scarce resources for some frameworks.
2) Guangya: Adding new sorters for non scarce resources plus a framework
capability GPU_AWARE to enable allocator filter out scarce resources for
some frameworks.

Both of the above two solutions are meaning same thing and there is no
difference between those two solutions: Create sub-pools of resources will
need to introduce different sorters for each sub-pools, so I will merge
those two solutions to one.

Also had some dicsussion with Ben for AlexR's solution of implementing
"requestResource", this API should be treated as an improvement to the
issues of doing resource allocation pessimistically. (e.g. we offer/decline
the GPUs to 1000 frameworks before offering it to the GPU framework that
wants it). And the "requestResource" is providing *more information* to
mesos. Namely, it gives us awareness of demand.

Even though for some cases, we can use the "requestResource" to get all of
the scarce resources, and then once those scarce resources are in use, then
the WDRF sorter will sorter non scarce resources as normal, but the problem
is that we cannot guarantee that the framework which have "requestResource"
can always consume all of the scarce resources before those scarce resource
allocated to other frameworks.

I'm planning to draft a document based on solution 1) "Create sub-pools"
for the long term solution, any comments are welcome!

Thanks,

Guangya

On Sat, Jun 18, 2016 at 11:58 AM, Guangya Liu  wrote:

> Thanks Du Fan. So you mean that we should have some clear rules in
> document or somewhere else to tell or guide cluster admin which resources
> should be classified as scarce resources, right?
>
> On Sat, Jun 18, 2016 at 2:38 AM, Du, Fan  wrote:
>
>>
>>
>> On 2016/6/17 7:57, Guangya Liu wrote:
>>
>>> @Fan Du,
>>>
>>> Currently, I think that the scarce resources should be defined by cluster
>>> admin, s/he can specify those scarce resources via a flag when master
>>> start
>>> up.
>>>
>>
>> This is not what I mean.
>> IMO, it's not cluster admin's call to decide what resources should be
>> marked as scarce , they can carry out the operation, but should be advised
>> on based on the clear rule: to what extend the resource is scarce compared
>> with other resources, and it will affect wDRF by causing starvation for
>> frameworks which holds scarce resources, that's my point.
>>
>> To my best knowledge here, a quantitative study of how wDRF behaves in
>> scenario of one/multiple scarce resources first will help to verify the
>> proposed approach, and guide the user of this functionality.
>>
>>
>>
>> Regarding to the proposal of generic scarce resources, do you have any
>>> thoughts on this? I can see that giving framework developers the options
>>> of
>>> define scarce resources may bring trouble to mesos, it is better to let
>>> mesos define those scarce but not framework developer.
>>>
>>
>


Re: [GPU] [Allocation] "Scarce" Resource Allocation

2016-06-17 Thread Guangya Liu
Thanks Du Fan. So you mean that we should have some clear rules in document
or somewhere else to tell or guide cluster admin which resources should be
classified as scarce resources, right?

On Sat, Jun 18, 2016 at 2:38 AM, Du, Fan  wrote:

>
>
> On 2016/6/17 7:57, Guangya Liu wrote:
>
>> @Fan Du,
>>
>> Currently, I think that the scarce resources should be defined by cluster
>> admin, s/he can specify those scarce resources via a flag when master
>> start
>> up.
>>
>
> This is not what I mean.
> IMO, it's not cluster admin's call to decide what resources should be
> marked as scarce , they can carry out the operation, but should be advised
> on based on the clear rule: to what extend the resource is scarce compared
> with other resources, and it will affect wDRF by causing starvation for
> frameworks which holds scarce resources, that's my point.
>
> To my best knowledge here, a quantitative study of how wDRF behaves in
> scenario of one/multiple scarce resources first will help to verify the
> proposed approach, and guide the user of this functionality.
>
>
>
> Regarding to the proposal of generic scarce resources, do you have any
>> thoughts on this? I can see that giving framework developers the options
>> of
>> define scarce resources may bring trouble to mesos, it is better to let
>> mesos define those scarce but not framework developer.
>>
>


Re: [GPU] [Allocation] "Scarce" Resource Allocation

2016-06-17 Thread Du, Fan



On 2016/6/17 7:57, Guangya Liu wrote:

@Fan Du,

Currently, I think that the scarce resources should be defined by cluster
admin, s/he can specify those scarce resources via a flag when master start
up.


This is not what I mean.
IMO, it's not cluster admin's call to decide what resources should be 
marked as scarce , they can carry out the operation, but should be 
advised on based on the clear rule: to what extend the resource is 
scarce compared with other resources, and it will affect wDRF by causing 
starvation for frameworks which holds scarce resources, that's my point.


To my best knowledge here, a quantitative study of how wDRF behaves in 
scenario of one/multiple scarce resources first will help to verify the 
proposed approach, and guide the user of this functionality.




Regarding to the proposal of generic scarce resources, do you have any
thoughts on this? I can see that giving framework developers the options of
define scarce resources may bring trouble to mesos, it is better to let
mesos define those scarce but not framework developer.


Re: [GPU] [Allocation] "Scarce" Resource Allocation

2016-06-16 Thread Guangya Liu
Thanks all for the input here!

@Hans van den Bogert,

Yes, agree with Alex R, Mesos is now using coarse grained mode to allocate
resources and the minimum unit is a single host, so you will always get cpu
and memory.

@Alex,

Yes, I was only listing sorters here, ideally, I think that an indeal
allocation sequence should be:

1) Allocate quota non scarce resources
2) Allocate quota scarce resources
3) Allocate reserved non scarce resources
4) Allocate reserved scarce resources
5) Allocate revocable non scarce resources
6) Allocate revocable scarce resources

Regarding to "requestResources", I think that even we implement it, the
scarce resources will still impact the WDRF sorter as Ben M pointed out in
his use cases.

An ideal solution would be "exclude scarce resources from sorter" plus
"requestResources" for scarce resources. The "exclude scarce resources from
sorter" will focus on non scarce resources while "requestResources" focus
on scarce resources.

I can see that till now, we have three solutions to handle scarce resources:
1) Ben M: Create sub-pools of resources based on machine profile and
perform fair sharing / quota within each pool plus a framework
capability GPU_AWARE
to enable allocator filter out scarce resources for some frameworks.
2) Guangya: Adding new sorters for non scarce resources plus a framework
capability GPU_AWARE to enable allocator filter out scarce resources for
some frameworks.
3) Alex R: "requestResources" for scarce resource plus "exclude scarce
resource from sorter" for non scarce resources (@Alex R, I was putting "exclude
scarce resource from sorter" to your proposal, hope it is OK?)

Solution 1) may cause low resource utilization as Ben M point out. Both 2)
and 3) still using resources in a single pool, so the resource utilization
will not be impacted.

For solution 2), I did not have strong intention to say which one is
better. For 2), my only concern is not sure many sorters can cause
performance issue, but as we should assume there are not too many scarce
resources in the cluster, so the performance should not impact much even if
we add another 3 sorters for scarce resources.

For solution 3), the only problem for "requestResource" is that it may lead
to the issue of "greedy framework" consume all resources, we may need to
consider enabling "requestResource" only request scarce resources first so
as to reduce the impact of some "greedy frameworks".

Another problem for solution 1) and 2) is we need to introduce framework
capability for each specified scarce resource to enable allocator filter
out the scarce resources when a new resources appeared, but I think that
this will not impact much as we should not have too many scarce resources
in the future due to those are "scarce resources".

@Fan Du,

Currently, I think that the scarce resources should be defined by cluster
admin, s/he can specify those scarce resources via a flag when master start
up.

Regarding to the proposal of generic scarce resources, do you have any
thoughts on this? I can see that giving framework developers the options of
define scarce resources may bring trouble to mesos, it is better to let
mesos define those scarce but not framework developer.

Thanks,

Guangya


On Fri, Jun 17, 2016 at 6:53 AM, Joris Van Remoortere 
wrote:

> @Fan,
>
> In the community meeting a question was raised around which frameworks
> might be ready to use this.
> Can you provide some more context for immediate use cases on the framework
> side?
>
>
> —
> *Joris Van Remoortere*
> Mesosphere
>
> On Fri, Jun 17, 2016 at 12:51 AM, Du, Fan  wrote:
>
> > A couple of rough thoughts in the early morning:
> >
> > a. Is there any quantitative way to decide a resource is kind of scare? I
> > mean how to aid operator to make this decision to use/not use this
> > functionality when deploying mesos.
> >
> > b. Scare resource extend from GPU to, name a few, Xeon Phi, FPGA, what
> > about make the proposal more generic and future proof?
> >
> >
> >
> > On 2016/6/11 10:50, Benjamin Mahler wrote:
> >
> >> I wanted to start a discussion about the allocation of "scarce"
> resources.
> >> "Scarce" in this context means resources that are not present on every
> >> machine. GPUs are the first example of a scarce resource that we support
> >> as
> >> a known resource type.
> >>
> >> Consider the behavior when there are the following agents in a cluster:
> >>
> >> 999 agents with (cpus:4,mem:1024,disk:1024)
> >> 1 agent with (gpus:1,cpus:4,mem:1024,disk:1024)
> >>
> >> Here there are 1000 machines but only 1 has GPUs. We call GPUs a
> "scarce"
> >> resource here because they are only present on a small percentage of the
> >> machines.
> >>
> >> We end up with some problematic behavior here with our current
> allocation
> >> model:
> >>
> >>  (1) If a role wishes to use both GPU and non-GPU resources for
> tasks,
> >> consuming 1 GPU will lead DRF to consider the role to have a 100% share
> of
> >> 

Re: [GPU] [Allocation] "Scarce" Resource Allocation

2016-06-16 Thread Joris Van Remoortere
@Fan,

In the community meeting a question was raised around which frameworks
might be ready to use this.
Can you provide some more context for immediate use cases on the framework
side?


—
*Joris Van Remoortere*
Mesosphere

On Fri, Jun 17, 2016 at 12:51 AM, Du, Fan  wrote:

> A couple of rough thoughts in the early morning:
>
> a. Is there any quantitative way to decide a resource is kind of scare? I
> mean how to aid operator to make this decision to use/not use this
> functionality when deploying mesos.
>
> b. Scare resource extend from GPU to, name a few, Xeon Phi, FPGA, what
> about make the proposal more generic and future proof?
>
>
>
> On 2016/6/11 10:50, Benjamin Mahler wrote:
>
>> I wanted to start a discussion about the allocation of "scarce" resources.
>> "Scarce" in this context means resources that are not present on every
>> machine. GPUs are the first example of a scarce resource that we support
>> as
>> a known resource type.
>>
>> Consider the behavior when there are the following agents in a cluster:
>>
>> 999 agents with (cpus:4,mem:1024,disk:1024)
>> 1 agent with (gpus:1,cpus:4,mem:1024,disk:1024)
>>
>> Here there are 1000 machines but only 1 has GPUs. We call GPUs a "scarce"
>> resource here because they are only present on a small percentage of the
>> machines.
>>
>> We end up with some problematic behavior here with our current allocation
>> model:
>>
>>  (1) If a role wishes to use both GPU and non-GPU resources for tasks,
>> consuming 1 GPU will lead DRF to consider the role to have a 100% share of
>> the cluster, since it consumes 100% of the GPUs in the cluster. This
>> framework will then not receive any other offers.
>>
>>  (2) Because we do not have revocation yet, if a framework decides to
>> consume the non-GPU resources on a GPU machine, it will prevent the GPU
>> workloads from running!
>>
>> 
>>
>> I filed an epic [1] to track this. The plan for the short-term is to
>> introduce two mechanisms to mitigate these issues:
>>
>>  -Introduce a resource fairness exclusion list. This allows the shares
>> of resources like "gpus" to be excluded from the dominant share.
>>
>>  -Introduce a GPU_AWARE framework capability. This indicates that the
>> scheduler is aware of GPUs and will schedule tasks accordingly. Old
>> schedulers will not have the capability and will not receive any offers
>> for
>> GPU machines. If a scheduler has the capability, we'll advise that they
>> avoid placing their additional non-GPU workloads on the GPU machines.
>>
>> 
>>
>> Longer term, we'll want a more robust way to manage scarce resources. The
>> first thought we had was to have sub-pools of resources based on machine
>> profile and perform fair sharing / quota within each pool. This addresses
>> (1) cleanly, and for (2) the operator needs to explicitly disallow non-GPU
>> frameworks from participating in the GPU pool.
>>
>> Unfortunately, by excluding non-GPU frameworks from the GPU pool we may
>> have a lower level of utilization. In the even longer term, as we add
>> revocation it will be possible to allow a scheduler desiring GPUs to
>> revoke
>> the resources allocated to the non-GPU workloads running on the GPU
>> machines. There are a number of things we need to put in place to support
>> revocation ([2], [3], [4], etc), so I'm glossing over the details here.
>>
>> If anyone has any thoughts or insight in this area, please share!
>>
>> Ben
>>
>> [1] https://issues.apache.org/jira/browse/MESOS-5377
>> [2] https://issues.apache.org/jira/browse/MESOS-5524
>> [3] https://issues.apache.org/jira/browse/MESOS-5527
>> [4] https://issues.apache.org/jira/browse/MESOS-4392
>>
>>


Re: [GPU] [Allocation] "Scarce" Resource Allocation

2016-06-16 Thread Du, Fan

A couple of rough thoughts in the early morning:

a. Is there any quantitative way to decide a resource is kind of scare? 
I mean how to aid operator to make this decision to use/not use this 
functionality when deploying mesos.


b. Scare resource extend from GPU to, name a few, Xeon Phi, FPGA, what 
about make the proposal more generic and future proof?



On 2016/6/11 10:50, Benjamin Mahler wrote:

I wanted to start a discussion about the allocation of "scarce" resources.
"Scarce" in this context means resources that are not present on every
machine. GPUs are the first example of a scarce resource that we support as
a known resource type.

Consider the behavior when there are the following agents in a cluster:

999 agents with (cpus:4,mem:1024,disk:1024)
1 agent with (gpus:1,cpus:4,mem:1024,disk:1024)

Here there are 1000 machines but only 1 has GPUs. We call GPUs a "scarce"
resource here because they are only present on a small percentage of the
machines.

We end up with some problematic behavior here with our current allocation
model:

 (1) If a role wishes to use both GPU and non-GPU resources for tasks,
consuming 1 GPU will lead DRF to consider the role to have a 100% share of
the cluster, since it consumes 100% of the GPUs in the cluster. This
framework will then not receive any other offers.

 (2) Because we do not have revocation yet, if a framework decides to
consume the non-GPU resources on a GPU machine, it will prevent the GPU
workloads from running!



I filed an epic [1] to track this. The plan for the short-term is to
introduce two mechanisms to mitigate these issues:

 -Introduce a resource fairness exclusion list. This allows the shares
of resources like "gpus" to be excluded from the dominant share.

 -Introduce a GPU_AWARE framework capability. This indicates that the
scheduler is aware of GPUs and will schedule tasks accordingly. Old
schedulers will not have the capability and will not receive any offers for
GPU machines. If a scheduler has the capability, we'll advise that they
avoid placing their additional non-GPU workloads on the GPU machines.



Longer term, we'll want a more robust way to manage scarce resources. The
first thought we had was to have sub-pools of resources based on machine
profile and perform fair sharing / quota within each pool. This addresses
(1) cleanly, and for (2) the operator needs to explicitly disallow non-GPU
frameworks from participating in the GPU pool.

Unfortunately, by excluding non-GPU frameworks from the GPU pool we may
have a lower level of utilization. In the even longer term, as we add
revocation it will be possible to allow a scheduler desiring GPUs to revoke
the resources allocated to the non-GPU workloads running on the GPU
machines. There are a number of things we need to put in place to support
revocation ([2], [3], [4], etc), so I'm glossing over the details here.

If anyone has any thoughts or insight in this area, please share!

Ben

[1] https://issues.apache.org/jira/browse/MESOS-5377
[2] https://issues.apache.org/jira/browse/MESOS-5524
[3] https://issues.apache.org/jira/browse/MESOS-5527
[4] https://issues.apache.org/jira/browse/MESOS-4392



Re: [GPU] [Allocation] "Scarce" Resource Allocation

2016-06-16 Thread Zhitao Li
+1 for leveraging `reqouestResources. I've also toyed this idea with
allocator groups offline. IMO Giving schedulers a way to specify resource
envelope size and/or constraints is a easier way to manage the resources.

On Thu, Jun 16, 2016 at 9:39 AM, Alex Rukletsov  wrote:

> We definitely don't want a 2-step scenario. In this case, a framework may
> not be able to launch its tasks on GPU resources, while still holding them.
>
> However, having a dedicated sorter for scarce resources does not mean we
> should allocate them separately. Also, I'm not sure Guangya intended to
> enumerate allocation stages, it looks like he simply listed the sorters. I
> don't see why we may want to allocate scarce resources after allocating
> revocable.
>
> I think a more important question is how to effectively offer scarce
> resources to frameworks that are interested in them. Maybe we can leverage
> `requestResources`?
>
> On Thu, Jun 16, 2016 at 5:22 PM, Hans van den Bogert  >
> wrote:
>
> > Hi all,
> >
> > Maybe I’m missing context info on how something like a GPU as a resource
> > should work, but I assume that the general scenario would be that the GPU
> > host application would still need memory and cpu(s) co-located on the
> node.
> > In the case of,
> > > 4) scarceSorter include 1 agent with (gpus:1)
> >
> > If I understand your meaning correctly, this round of offers would, in
> > this case, only consists of the GPU resource. Is it then up to the
> > framework to figure out it will also need cpu and memory on the agent’s
> > node? If so, it would need at least another offer for that node to make
> the
> > GPU resource useful. Such a 2-step offer/accept is rather cumbersome.
> >
> > Regards,
> >
> > Hans
> >
> >
> >
> > > On Jun 16, 2016, at 11:26 AM, Guangya Liu  wrote:
> > >
> > > Hi Ben,
> > >
> > > The pre-condition for four stage allocation is that we need to put
> > > different resources to different sorters:
> > >
> > > 1) roleSorter only include non scarce resources.
> > > 2) quotaRoleSorter only include non revocable & non scarce resources.
> > > 3) revocableSorter only include revocable & non scarce resources. This
> > will
> > > be handled in MESOS-4923 <
> > https://issues.apache.org/jira/browse/MESOS-4923>
> > > 4) scarceSorter only include scarce resources.
> > >
> > > Take your case above:
> > > 999 agents with (cpus:4,mem:1024,disk:1024)
> > > 1 agent with (gpus:1,cpus:4,mem:1024,disk:1024)
> > >
> > > The four sorters would be:
> > > 1) roleSorter include 1000 agents with (cpus:4,mem:1024,disk:1024)
> > > 2) quotaRoleSorter include 1000 agents with (cpus:4,mem:1024,disk:1024)
> > > 3) revocableSorter include nothing as I have no revocable resources
> here.
> > > 4) scarceSorter include 1 agent with (gpus:1)
> > >
> > > When allocate resources, even if a role got the agent with gpu
> resources,
> > > its share will only be counter by scarceSorter but not other sorters,
> and
> > > will not impact other sorters.
> > >
> > > The above solution is actually kind of enhancement to "exclude scarce
> > > resources" as the scarce resources also obey the DRF algorithm with
> this.
> > >
> > > This solution can be also treated as diving the whole resources pool
> > > logically to scarce and non scarce resource pool. 1), 2) and 3) will
> > handle
> > > non scarce resources while 4) focus on scarce resources.
> > >
> > > Thanks,
> > >
> > > Guangya
> > >
> > > On Thu, Jun 16, 2016 at 2:10 AM, Benjamin Mahler 
> > wrote:
> > >
> > >> Hm.. can you expand on how adding another allocation stage for only
> > scarce
> > >> resources would behave well? It seems to have a number of problems
> when
> > I
> > >> think through it.
> > >>
> > >> On Sat, Jun 11, 2016 at 7:59 AM, Guangya Liu 
> > wrote:
> > >>
> > >>> Hi Ben,
> > >>>
> > >>> For long term goal, instead of creating sub-pool, what about adding a
> > new
> > >>> sorter to handle **scare** resources? The current logic in allocator
> > was
> > >>> divided to two stages: allocation for quota, allocation for non quota
> > >>> resources.
> > >>>
> > >>> I think that the future logic in allocator would be divided to four
> > >>> stages:
> > >>> 1) allocation for quota
> > >>> 2) allocation for reserved resources
> > >>> 3) allocation for revocable resources
> > >>> 4) allocation for scare resources
> > >>>
> > >>> Thanks,
> > >>>
> > >>> Guangy
> > >>>
> > >>> On Sat, Jun 11, 2016 at 10:50 AM, Benjamin Mahler <
> bmah...@apache.org>
> > >>> wrote:
> > >>>
> >  I wanted to start a discussion about the allocation of "scarce"
> >  resources. "Scarce" in this context means resources that are not
> > present on
> >  every machine. GPUs are the first example of a scarce resource that
> we
> >  support as a known resource type.
> > 
> >  Consider the behavior when there are the following agents in a
> > cluster:
> > 
> >  999 agents with 

Re: [GPU] [Allocation] "Scarce" Resource Allocation

2016-06-16 Thread Alex Rukletsov
We definitely don't want a 2-step scenario. In this case, a framework may
not be able to launch its tasks on GPU resources, while still holding them.

However, having a dedicated sorter for scarce resources does not mean we
should allocate them separately. Also, I'm not sure Guangya intended to
enumerate allocation stages, it looks like he simply listed the sorters. I
don't see why we may want to allocate scarce resources after allocating
revocable.

I think a more important question is how to effectively offer scarce
resources to frameworks that are interested in them. Maybe we can leverage
`requestResources`?

On Thu, Jun 16, 2016 at 5:22 PM, Hans van den Bogert 
wrote:

> Hi all,
>
> Maybe I’m missing context info on how something like a GPU as a resource
> should work, but I assume that the general scenario would be that the GPU
> host application would still need memory and cpu(s) co-located on the node.
> In the case of,
> > 4) scarceSorter include 1 agent with (gpus:1)
>
> If I understand your meaning correctly, this round of offers would, in
> this case, only consists of the GPU resource. Is it then up to the
> framework to figure out it will also need cpu and memory on the agent’s
> node? If so, it would need at least another offer for that node to make the
> GPU resource useful. Such a 2-step offer/accept is rather cumbersome.
>
> Regards,
>
> Hans
>
>
>
> > On Jun 16, 2016, at 11:26 AM, Guangya Liu  wrote:
> >
> > Hi Ben,
> >
> > The pre-condition for four stage allocation is that we need to put
> > different resources to different sorters:
> >
> > 1) roleSorter only include non scarce resources.
> > 2) quotaRoleSorter only include non revocable & non scarce resources.
> > 3) revocableSorter only include revocable & non scarce resources. This
> will
> > be handled in MESOS-4923 <
> https://issues.apache.org/jira/browse/MESOS-4923>
> > 4) scarceSorter only include scarce resources.
> >
> > Take your case above:
> > 999 agents with (cpus:4,mem:1024,disk:1024)
> > 1 agent with (gpus:1,cpus:4,mem:1024,disk:1024)
> >
> > The four sorters would be:
> > 1) roleSorter include 1000 agents with (cpus:4,mem:1024,disk:1024)
> > 2) quotaRoleSorter include 1000 agents with (cpus:4,mem:1024,disk:1024)
> > 3) revocableSorter include nothing as I have no revocable resources here.
> > 4) scarceSorter include 1 agent with (gpus:1)
> >
> > When allocate resources, even if a role got the agent with gpu resources,
> > its share will only be counter by scarceSorter but not other sorters, and
> > will not impact other sorters.
> >
> > The above solution is actually kind of enhancement to "exclude scarce
> > resources" as the scarce resources also obey the DRF algorithm with this.
> >
> > This solution can be also treated as diving the whole resources pool
> > logically to scarce and non scarce resource pool. 1), 2) and 3) will
> handle
> > non scarce resources while 4) focus on scarce resources.
> >
> > Thanks,
> >
> > Guangya
> >
> > On Thu, Jun 16, 2016 at 2:10 AM, Benjamin Mahler 
> wrote:
> >
> >> Hm.. can you expand on how adding another allocation stage for only
> scarce
> >> resources would behave well? It seems to have a number of problems when
> I
> >> think through it.
> >>
> >> On Sat, Jun 11, 2016 at 7:59 AM, Guangya Liu 
> wrote:
> >>
> >>> Hi Ben,
> >>>
> >>> For long term goal, instead of creating sub-pool, what about adding a
> new
> >>> sorter to handle **scare** resources? The current logic in allocator
> was
> >>> divided to two stages: allocation for quota, allocation for non quota
> >>> resources.
> >>>
> >>> I think that the future logic in allocator would be divided to four
> >>> stages:
> >>> 1) allocation for quota
> >>> 2) allocation for reserved resources
> >>> 3) allocation for revocable resources
> >>> 4) allocation for scare resources
> >>>
> >>> Thanks,
> >>>
> >>> Guangy
> >>>
> >>> On Sat, Jun 11, 2016 at 10:50 AM, Benjamin Mahler 
> >>> wrote:
> >>>
>  I wanted to start a discussion about the allocation of "scarce"
>  resources. "Scarce" in this context means resources that are not
> present on
>  every machine. GPUs are the first example of a scarce resource that we
>  support as a known resource type.
> 
>  Consider the behavior when there are the following agents in a
> cluster:
> 
>  999 agents with (cpus:4,mem:1024,disk:1024)
>  1 agent with (gpus:1,cpus:4,mem:1024,disk:1024)
> 
>  Here there are 1000 machines but only 1 has GPUs. We call GPUs a
>  "scarce" resource here because they are only present on a small
> percentage
>  of the machines.
> 
>  We end up with some problematic behavior here with our current
>  allocation model:
> 
> (1) If a role wishes to use both GPU and non-GPU resources for
>  tasks, consuming 1 GPU will lead DRF to consider the role to have a
> 100%
>  share of the 

Re: [GPU] [Allocation] "Scarce" Resource Allocation

2016-06-16 Thread Hans van den Bogert
Hi all, 

Maybe I’m missing context info on how something like a GPU as a resource should 
work, but I assume that the general scenario would be that the GPU host 
application would still need memory and cpu(s) co-located on the node.
In the case of,
> 4) scarceSorter include 1 agent with (gpus:1)

If I understand your meaning correctly, this round of offers would, in this 
case, only consists of the GPU resource. Is it then up to the framework to 
figure out it will also need cpu and memory on the agent’s node? If so, it 
would need at least another offer for that node to make the GPU resource 
useful. Such a 2-step offer/accept is rather cumbersome. 

Regards,

Hans



> On Jun 16, 2016, at 11:26 AM, Guangya Liu  wrote:
> 
> Hi Ben,
> 
> The pre-condition for four stage allocation is that we need to put
> different resources to different sorters:
> 
> 1) roleSorter only include non scarce resources.
> 2) quotaRoleSorter only include non revocable & non scarce resources.
> 3) revocableSorter only include revocable & non scarce resources. This will
> be handled in MESOS-4923 
> 4) scarceSorter only include scarce resources.
> 
> Take your case above:
> 999 agents with (cpus:4,mem:1024,disk:1024)
> 1 agent with (gpus:1,cpus:4,mem:1024,disk:1024)
> 
> The four sorters would be:
> 1) roleSorter include 1000 agents with (cpus:4,mem:1024,disk:1024)
> 2) quotaRoleSorter include 1000 agents with (cpus:4,mem:1024,disk:1024)
> 3) revocableSorter include nothing as I have no revocable resources here.
> 4) scarceSorter include 1 agent with (gpus:1)
> 
> When allocate resources, even if a role got the agent with gpu resources,
> its share will only be counter by scarceSorter but not other sorters, and
> will not impact other sorters.
> 
> The above solution is actually kind of enhancement to "exclude scarce
> resources" as the scarce resources also obey the DRF algorithm with this.
> 
> This solution can be also treated as diving the whole resources pool
> logically to scarce and non scarce resource pool. 1), 2) and 3) will handle
> non scarce resources while 4) focus on scarce resources.
> 
> Thanks,
> 
> Guangya
> 
> On Thu, Jun 16, 2016 at 2:10 AM, Benjamin Mahler  wrote:
> 
>> Hm.. can you expand on how adding another allocation stage for only scarce
>> resources would behave well? It seems to have a number of problems when I
>> think through it.
>> 
>> On Sat, Jun 11, 2016 at 7:59 AM, Guangya Liu  wrote:
>> 
>>> Hi Ben,
>>> 
>>> For long term goal, instead of creating sub-pool, what about adding a new
>>> sorter to handle **scare** resources? The current logic in allocator was
>>> divided to two stages: allocation for quota, allocation for non quota
>>> resources.
>>> 
>>> I think that the future logic in allocator would be divided to four
>>> stages:
>>> 1) allocation for quota
>>> 2) allocation for reserved resources
>>> 3) allocation for revocable resources
>>> 4) allocation for scare resources
>>> 
>>> Thanks,
>>> 
>>> Guangy
>>> 
>>> On Sat, Jun 11, 2016 at 10:50 AM, Benjamin Mahler 
>>> wrote:
>>> 
 I wanted to start a discussion about the allocation of "scarce"
 resources. "Scarce" in this context means resources that are not present on
 every machine. GPUs are the first example of a scarce resource that we
 support as a known resource type.
 
 Consider the behavior when there are the following agents in a cluster:
 
 999 agents with (cpus:4,mem:1024,disk:1024)
 1 agent with (gpus:1,cpus:4,mem:1024,disk:1024)
 
 Here there are 1000 machines but only 1 has GPUs. We call GPUs a
 "scarce" resource here because they are only present on a small percentage
 of the machines.
 
 We end up with some problematic behavior here with our current
 allocation model:
 
(1) If a role wishes to use both GPU and non-GPU resources for
 tasks, consuming 1 GPU will lead DRF to consider the role to have a 100%
 share of the cluster, since it consumes 100% of the GPUs in the cluster.
 This framework will then not receive any other offers.
 
(2) Because we do not have revocation yet, if a framework decides to
 consume the non-GPU resources on a GPU machine, it will prevent the GPU
 workloads from running!
 
 
 
 I filed an epic [1] to track this. The plan for the short-term is to
 introduce two mechanisms to mitigate these issues:
 
-Introduce a resource fairness exclusion list. This allows the
 shares of resources like "gpus" to be excluded from the dominant share.
 
-Introduce a GPU_AWARE framework capability. This indicates that the
 scheduler is aware of GPUs and will schedule tasks accordingly. Old
 schedulers will not have the capability and will not receive any offers for
 GPU machines. If a scheduler has the capability, 

Re: [GPU] [Allocation] "Scarce" Resource Allocation

2016-06-16 Thread Guangya Liu
Thanks Joris, sorry, I forgot the case when the scarce resources was also
requested by quota.

But after a second thought, not only quota, but also reserved resources,
revocable resources can also be scarce resources, we may need to handle all
of those cases.

I think that in the future, the allocator should allocate resources as this:
1) Allocate resources for quota.
2) Allocate reserved resources
3) Allocate revocable resources - After "revocable by default" project, I
think that we will only have reserved resources and revocable resources.

So we will need three steps to allocate all resources based on above
analysis, but after introduced scarce resources, we need to split all of
above three kind resource to two: one is scare and the other is non scarce.

Then there should be six sorters:
1) quota non scarce sorter
2) non scarce reserved sorter
3) non scarce revocable sorter
4) quota scarce sorter
5) scarce reserved sorter
6) scarce revocable sorter

Since there are not too many hosts have scarce resources, so the last three
sorter for scarce resources may not impact performance much, comments?

Thanks,

Guangya

On Thu, Jun 16, 2016 at 7:30 PM, Joris Van Remoortere 
wrote:

> With this 4th sorter approach, how does quota work for scarce resources?
>
> —
> *Joris Van Remoortere*
> Mesosphere
>
> On Thu, Jun 16, 2016 at 11:26 AM, Guangya Liu  wrote:
>
> > Hi Ben,
> >
> > The pre-condition for four stage allocation is that we need to put
> > different resources to different sorters:
> >
> > 1) roleSorter only include non scarce resources.
> > 2) quotaRoleSorter only include non revocable & non scarce resources.
> > 3) revocableSorter only include revocable & non scarce resources. This
> will
> > be handled in MESOS-4923 <
> https://issues.apache.org/jira/browse/MESOS-4923
> > >
> > 4) scarceSorter only include scarce resources.
> >
> > Take your case above:
> > 999 agents with (cpus:4,mem:1024,disk:1024)
> > 1 agent with (gpus:1,cpus:4,mem:1024,disk:1024)
> >
> > The four sorters would be:
> > 1) roleSorter include 1000 agents with (cpus:4,mem:1024,disk:1024)
> > 2) quotaRoleSorter include 1000 agents with (cpus:4,mem:1024,disk:1024)
> > 3) revocableSorter include nothing as I have no revocable resources here.
> > 4) scarceSorter include 1 agent with (gpus:1)
> >
> > When allocate resources, even if a role got the agent with gpu resources,
> > its share will only be counter by scarceSorter but not other sorters, and
> > will not impact other sorters.
> >
> > The above solution is actually kind of enhancement to "exclude scarce
> > resources" as the scarce resources also obey the DRF algorithm with this.
> >
> > This solution can be also treated as diving the whole resources pool
> > logically to scarce and non scarce resource pool. 1), 2) and 3) will
> handle
> > non scarce resources while 4) focus on scarce resources.
> >
> > Thanks,
> >
> > Guangya
> >
> > On Thu, Jun 16, 2016 at 2:10 AM, Benjamin Mahler 
> > wrote:
> >
> > > Hm.. can you expand on how adding another allocation stage for only
> > scarce
> > > resources would behave well? It seems to have a number of problems
> when I
> > > think through it.
> > >
> > > On Sat, Jun 11, 2016 at 7:59 AM, Guangya Liu 
> wrote:
> > >
> > >> Hi Ben,
> > >>
> > >> For long term goal, instead of creating sub-pool, what about adding a
> > new
> > >> sorter to handle **scare** resources? The current logic in allocator
> was
> > >> divided to two stages: allocation for quota, allocation for non quota
> > >> resources.
> > >>
> > >> I think that the future logic in allocator would be divided to four
> > >> stages:
> > >> 1) allocation for quota
> > >> 2) allocation for reserved resources
> > >> 3) allocation for revocable resources
> > >> 4) allocation for scare resources
> > >>
> > >> Thanks,
> > >>
> > >> Guangy
> > >>
> > >> On Sat, Jun 11, 2016 at 10:50 AM, Benjamin Mahler  >
> > >> wrote:
> > >>
> > >>> I wanted to start a discussion about the allocation of "scarce"
> > >>> resources. "Scarce" in this context means resources that are not
> > present on
> > >>> every machine. GPUs are the first example of a scarce resource that
> we
> > >>> support as a known resource type.
> > >>>
> > >>> Consider the behavior when there are the following agents in a
> cluster:
> > >>>
> > >>> 999 agents with (cpus:4,mem:1024,disk:1024)
> > >>> 1 agent with (gpus:1,cpus:4,mem:1024,disk:1024)
> > >>>
> > >>> Here there are 1000 machines but only 1 has GPUs. We call GPUs a
> > >>> "scarce" resource here because they are only present on a small
> > percentage
> > >>> of the machines.
> > >>>
> > >>> We end up with some problematic behavior here with our current
> > >>> allocation model:
> > >>>
> > >>> (1) If a role wishes to use both GPU and non-GPU resources for
> > >>> tasks, consuming 1 GPU will lead DRF to consider the role to have a
> > 100%
> > >>> share of the 

Re: [GPU] [Allocation] "Scarce" Resource Allocation

2016-06-16 Thread Joris Van Remoortere
With this 4th sorter approach, how does quota work for scarce resources?

—
*Joris Van Remoortere*
Mesosphere

On Thu, Jun 16, 2016 at 11:26 AM, Guangya Liu  wrote:

> Hi Ben,
>
> The pre-condition for four stage allocation is that we need to put
> different resources to different sorters:
>
> 1) roleSorter only include non scarce resources.
> 2) quotaRoleSorter only include non revocable & non scarce resources.
> 3) revocableSorter only include revocable & non scarce resources. This will
> be handled in MESOS-4923  >
> 4) scarceSorter only include scarce resources.
>
> Take your case above:
> 999 agents with (cpus:4,mem:1024,disk:1024)
> 1 agent with (gpus:1,cpus:4,mem:1024,disk:1024)
>
> The four sorters would be:
> 1) roleSorter include 1000 agents with (cpus:4,mem:1024,disk:1024)
> 2) quotaRoleSorter include 1000 agents with (cpus:4,mem:1024,disk:1024)
> 3) revocableSorter include nothing as I have no revocable resources here.
> 4) scarceSorter include 1 agent with (gpus:1)
>
> When allocate resources, even if a role got the agent with gpu resources,
> its share will only be counter by scarceSorter but not other sorters, and
> will not impact other sorters.
>
> The above solution is actually kind of enhancement to "exclude scarce
> resources" as the scarce resources also obey the DRF algorithm with this.
>
> This solution can be also treated as diving the whole resources pool
> logically to scarce and non scarce resource pool. 1), 2) and 3) will handle
> non scarce resources while 4) focus on scarce resources.
>
> Thanks,
>
> Guangya
>
> On Thu, Jun 16, 2016 at 2:10 AM, Benjamin Mahler 
> wrote:
>
> > Hm.. can you expand on how adding another allocation stage for only
> scarce
> > resources would behave well? It seems to have a number of problems when I
> > think through it.
> >
> > On Sat, Jun 11, 2016 at 7:59 AM, Guangya Liu  wrote:
> >
> >> Hi Ben,
> >>
> >> For long term goal, instead of creating sub-pool, what about adding a
> new
> >> sorter to handle **scare** resources? The current logic in allocator was
> >> divided to two stages: allocation for quota, allocation for non quota
> >> resources.
> >>
> >> I think that the future logic in allocator would be divided to four
> >> stages:
> >> 1) allocation for quota
> >> 2) allocation for reserved resources
> >> 3) allocation for revocable resources
> >> 4) allocation for scare resources
> >>
> >> Thanks,
> >>
> >> Guangy
> >>
> >> On Sat, Jun 11, 2016 at 10:50 AM, Benjamin Mahler 
> >> wrote:
> >>
> >>> I wanted to start a discussion about the allocation of "scarce"
> >>> resources. "Scarce" in this context means resources that are not
> present on
> >>> every machine. GPUs are the first example of a scarce resource that we
> >>> support as a known resource type.
> >>>
> >>> Consider the behavior when there are the following agents in a cluster:
> >>>
> >>> 999 agents with (cpus:4,mem:1024,disk:1024)
> >>> 1 agent with (gpus:1,cpus:4,mem:1024,disk:1024)
> >>>
> >>> Here there are 1000 machines but only 1 has GPUs. We call GPUs a
> >>> "scarce" resource here because they are only present on a small
> percentage
> >>> of the machines.
> >>>
> >>> We end up with some problematic behavior here with our current
> >>> allocation model:
> >>>
> >>> (1) If a role wishes to use both GPU and non-GPU resources for
> >>> tasks, consuming 1 GPU will lead DRF to consider the role to have a
> 100%
> >>> share of the cluster, since it consumes 100% of the GPUs in the
> cluster.
> >>> This framework will then not receive any other offers.
> >>>
> >>> (2) Because we do not have revocation yet, if a framework decides
> to
> >>> consume the non-GPU resources on a GPU machine, it will prevent the GPU
> >>> workloads from running!
> >>>
> >>> 
> >>>
> >>> I filed an epic [1] to track this. The plan for the short-term is to
> >>> introduce two mechanisms to mitigate these issues:
> >>>
> >>> -Introduce a resource fairness exclusion list. This allows the
> >>> shares of resources like "gpus" to be excluded from the dominant share.
> >>>
> >>> -Introduce a GPU_AWARE framework capability. This indicates that
> the
> >>> scheduler is aware of GPUs and will schedule tasks accordingly. Old
> >>> schedulers will not have the capability and will not receive any
> offers for
> >>> GPU machines. If a scheduler has the capability, we'll advise that they
> >>> avoid placing their additional non-GPU workloads on the GPU machines.
> >>>
> >>> 
> >>>
> >>> Longer term, we'll want a more robust way to manage scarce resources.
> >>> The first thought we had was to have sub-pools of resources based on
> >>> machine profile and perform fair sharing / quota within each pool. This
> >>> addresses (1) cleanly, and for (2) the operator needs to explicitly
> >>> disallow non-GPU frameworks from participating in the GPU pool.
> >>>
> 

Re: [GPU] [Allocation] "Scarce" Resource Allocation

2016-06-16 Thread Guangya Liu
Hi Ben,

The pre-condition for four stage allocation is that we need to put
different resources to different sorters:

1) roleSorter only include non scarce resources.
2) quotaRoleSorter only include non revocable & non scarce resources.
3) revocableSorter only include revocable & non scarce resources. This will
be handled in MESOS-4923 
4) scarceSorter only include scarce resources.

Take your case above:
999 agents with (cpus:4,mem:1024,disk:1024)
1 agent with (gpus:1,cpus:4,mem:1024,disk:1024)

The four sorters would be:
1) roleSorter include 1000 agents with (cpus:4,mem:1024,disk:1024)
2) quotaRoleSorter include 1000 agents with (cpus:4,mem:1024,disk:1024)
3) revocableSorter include nothing as I have no revocable resources here.
4) scarceSorter include 1 agent with (gpus:1)

When allocate resources, even if a role got the agent with gpu resources,
its share will only be counter by scarceSorter but not other sorters, and
will not impact other sorters.

The above solution is actually kind of enhancement to "exclude scarce
resources" as the scarce resources also obey the DRF algorithm with this.

This solution can be also treated as diving the whole resources pool
logically to scarce and non scarce resource pool. 1), 2) and 3) will handle
non scarce resources while 4) focus on scarce resources.

Thanks,

Guangya

On Thu, Jun 16, 2016 at 2:10 AM, Benjamin Mahler  wrote:

> Hm.. can you expand on how adding another allocation stage for only scarce
> resources would behave well? It seems to have a number of problems when I
> think through it.
>
> On Sat, Jun 11, 2016 at 7:59 AM, Guangya Liu  wrote:
>
>> Hi Ben,
>>
>> For long term goal, instead of creating sub-pool, what about adding a new
>> sorter to handle **scare** resources? The current logic in allocator was
>> divided to two stages: allocation for quota, allocation for non quota
>> resources.
>>
>> I think that the future logic in allocator would be divided to four
>> stages:
>> 1) allocation for quota
>> 2) allocation for reserved resources
>> 3) allocation for revocable resources
>> 4) allocation for scare resources
>>
>> Thanks,
>>
>> Guangy
>>
>> On Sat, Jun 11, 2016 at 10:50 AM, Benjamin Mahler 
>> wrote:
>>
>>> I wanted to start a discussion about the allocation of "scarce"
>>> resources. "Scarce" in this context means resources that are not present on
>>> every machine. GPUs are the first example of a scarce resource that we
>>> support as a known resource type.
>>>
>>> Consider the behavior when there are the following agents in a cluster:
>>>
>>> 999 agents with (cpus:4,mem:1024,disk:1024)
>>> 1 agent with (gpus:1,cpus:4,mem:1024,disk:1024)
>>>
>>> Here there are 1000 machines but only 1 has GPUs. We call GPUs a
>>> "scarce" resource here because they are only present on a small percentage
>>> of the machines.
>>>
>>> We end up with some problematic behavior here with our current
>>> allocation model:
>>>
>>> (1) If a role wishes to use both GPU and non-GPU resources for
>>> tasks, consuming 1 GPU will lead DRF to consider the role to have a 100%
>>> share of the cluster, since it consumes 100% of the GPUs in the cluster.
>>> This framework will then not receive any other offers.
>>>
>>> (2) Because we do not have revocation yet, if a framework decides to
>>> consume the non-GPU resources on a GPU machine, it will prevent the GPU
>>> workloads from running!
>>>
>>> 
>>>
>>> I filed an epic [1] to track this. The plan for the short-term is to
>>> introduce two mechanisms to mitigate these issues:
>>>
>>> -Introduce a resource fairness exclusion list. This allows the
>>> shares of resources like "gpus" to be excluded from the dominant share.
>>>
>>> -Introduce a GPU_AWARE framework capability. This indicates that the
>>> scheduler is aware of GPUs and will schedule tasks accordingly. Old
>>> schedulers will not have the capability and will not receive any offers for
>>> GPU machines. If a scheduler has the capability, we'll advise that they
>>> avoid placing their additional non-GPU workloads on the GPU machines.
>>>
>>> 
>>>
>>> Longer term, we'll want a more robust way to manage scarce resources.
>>> The first thought we had was to have sub-pools of resources based on
>>> machine profile and perform fair sharing / quota within each pool. This
>>> addresses (1) cleanly, and for (2) the operator needs to explicitly
>>> disallow non-GPU frameworks from participating in the GPU pool.
>>>
>>> Unfortunately, by excluding non-GPU frameworks from the GPU pool we may
>>> have a lower level of utilization. In the even longer term, as we add
>>> revocation it will be possible to allow a scheduler desiring GPUs to revoke
>>> the resources allocated to the non-GPU workloads running on the GPU
>>> machines. There are a number of things we need to put in place to support
>>> revocation ([2], [3], [4], etc), so I'm 

Re: [GPU] [Allocation] "Scarce" Resource Allocation

2016-06-13 Thread Yong Feng
Thanks Ben.M for initiative.

You already define the scarce resource and describe the problem clearly.
Just recap the requirements we want to resolved as follows

1) Scarce resource should not be treated as dominated resource during
allocating so that user could continue using non-scarce resource no matter
how many scarce resource he/she used.
2) Host with scarce resource should be firstly (or only) allocated to
frameworks who need scarce resource so that those frameworks will not be
starved

However it also introduce new problems for example

How to enforce "fairness" among frameworks when allocating scarce
resources?
How to improve resource utilization when cluster is partitioned by scarce
resource?

Go back to the use case of GPU, I agree on the short term solutions

a) Exclude GPU from dominated resource
b) Allow framework to specify scarce resource for example GPU as capability
when register, and then Mesos only send offer with scarce resource to those
framework. The capability should not be hard coded to only handle GPU.

For long term, since scarce resource and even host attribute already
partition cluster into resource pools, we should have solution to resolve
the new introduced problems mentioned above.

1) Mesos will need enforce fairness during allocating scarce resource (or
host with special attribute) as well
2) Mesos should allow frameworks who do not ask for scarce resource (or
host with special attribute) to use host with scarce resource (or special
attribute), if there is no framework ask for scarce resource (or host with
special attribute).
3) Mesos should allow frameworks who ask for scarce resource preempt scarce
resource back.

YARN's feature of node label in future release of 2.8 (
https://wangda.live/2016/04/16/suggestions-about-how-to-better-use-yarn-node-label/)
try to resolve the similar use cases (GPU and so on). Even though YARN's
allocation model is request based, we still are able to gain experience
from it.

Thanks,

Yong

On Fri, Jun 10, 2016 at 10:50 PM, Benjamin Mahler 
wrote:

> I wanted to start a discussion about the allocation of "scarce" resources.
> "Scarce" in this context means resources that are not present on every
> machine. GPUs are the first example of a scarce resource that we support as
> a known resource type.
>
> Consider the behavior when there are the following agents in a cluster:
>
> 999 agents with (cpus:4,mem:1024,disk:1024)
> 1 agent with (gpus:1,cpus:4,mem:1024,disk:1024)
>
> Here there are 1000 machines but only 1 has GPUs. We call GPUs a "scarce"
> resource here because they are only present on a small percentage of the
> machines.
>
> We end up with some problematic behavior here with our current allocation
> model:
>
> (1) If a role wishes to use both GPU and non-GPU resources for tasks,
> consuming 1 GPU will lead DRF to consider the role to have a 100% share of
> the cluster, since it consumes 100% of the GPUs in the cluster. This
> framework will then not receive any other offers.
>
> (2) Because we do not have revocation yet, if a framework decides to
> consume the non-GPU resources on a GPU machine, it will prevent the GPU
> workloads from running!
>
> 
>
> I filed an epic [1] to track this. The plan for the short-term is to
> introduce two mechanisms to mitigate these issues:
>
> -Introduce a resource fairness exclusion list. This allows the shares
> of resources like "gpus" to be excluded from the dominant share.
>
> -Introduce a GPU_AWARE framework capability. This indicates that the
> scheduler is aware of GPUs and will schedule tasks accordingly. Old
> schedulers will not have the capability and will not receive any offers for
> GPU machines. If a scheduler has the capability, we'll advise that they
> avoid placing their additional non-GPU workloads on the GPU machines.
>
> 
>
> Longer term, we'll want a more robust way to manage scarce resources. The
> first thought we had was to have sub-pools of resources based on machine
> profile and perform fair sharing / quota within each pool. This addresses
> (1) cleanly, and for (2) the operator needs to explicitly disallow non-GPU
> frameworks from participating in the GPU pool.
>
> Unfortunately, by excluding non-GPU frameworks from the GPU pool we may
> have a lower level of utilization. In the even longer term, as we add
> revocation it will be possible to allow a scheduler desiring GPUs to revoke
> the resources allocated to the non-GPU workloads running on the GPU
> machines. There are a number of things we need to put in place to support
> revocation ([2], [3], [4], etc), so I'm glossing over the details here.
>
> If anyone has any thoughts or insight in this area, please share!
>
> Ben
>
> [1] https://issues.apache.org/jira/browse/MESOS-5377
> [2] https://issues.apache.org/jira/browse/MESOS-5524
> [3] https://issues.apache.org/jira/browse/MESOS-5527
> [4] https://issues.apache.org/jira/browse/MESOS-4392
>


Re: [GPU] [Allocation] "Scarce" Resource Allocation

2016-06-11 Thread Guangya Liu
Hi Ben,

For long term goal, instead of creating sub-pool, what about adding a new
sorter to handle **scare** resources? The current logic in allocator was
divided to two stages: allocation for quota, allocation for non quota
resources.

I think that the future logic in allocator would be divided to four stages:
1) allocation for quota
2) allocation for reserved resources
3) allocation for revocable resources
4) allocation for scare resources

Thanks,

Guangy

On Sat, Jun 11, 2016 at 10:50 AM, Benjamin Mahler 
wrote:

> I wanted to start a discussion about the allocation of "scarce" resources.
> "Scarce" in this context means resources that are not present on every
> machine. GPUs are the first example of a scarce resource that we support as
> a known resource type.
>
> Consider the behavior when there are the following agents in a cluster:
>
> 999 agents with (cpus:4,mem:1024,disk:1024)
> 1 agent with (gpus:1,cpus:4,mem:1024,disk:1024)
>
> Here there are 1000 machines but only 1 has GPUs. We call GPUs a "scarce"
> resource here because they are only present on a small percentage of the
> machines.
>
> We end up with some problematic behavior here with our current allocation
> model:
>
> (1) If a role wishes to use both GPU and non-GPU resources for tasks,
> consuming 1 GPU will lead DRF to consider the role to have a 100% share of
> the cluster, since it consumes 100% of the GPUs in the cluster. This
> framework will then not receive any other offers.
>
> (2) Because we do not have revocation yet, if a framework decides to
> consume the non-GPU resources on a GPU machine, it will prevent the GPU
> workloads from running!
>
> 
>
> I filed an epic [1] to track this. The plan for the short-term is to
> introduce two mechanisms to mitigate these issues:
>
> -Introduce a resource fairness exclusion list. This allows the shares
> of resources like "gpus" to be excluded from the dominant share.
>
> -Introduce a GPU_AWARE framework capability. This indicates that the
> scheduler is aware of GPUs and will schedule tasks accordingly. Old
> schedulers will not have the capability and will not receive any offers for
> GPU machines. If a scheduler has the capability, we'll advise that they
> avoid placing their additional non-GPU workloads on the GPU machines.
>
> 
>
> Longer term, we'll want a more robust way to manage scarce resources. The
> first thought we had was to have sub-pools of resources based on machine
> profile and perform fair sharing / quota within each pool. This addresses
> (1) cleanly, and for (2) the operator needs to explicitly disallow non-GPU
> frameworks from participating in the GPU pool.
>
> Unfortunately, by excluding non-GPU frameworks from the GPU pool we may
> have a lower level of utilization. In the even longer term, as we add
> revocation it will be possible to allow a scheduler desiring GPUs to revoke
> the resources allocated to the non-GPU workloads running on the GPU
> machines. There are a number of things we need to put in place to support
> revocation ([2], [3], [4], etc), so I'm glossing over the details here.
>
> If anyone has any thoughts or insight in this area, please share!
>
> Ben
>
> [1] https://issues.apache.org/jira/browse/MESOS-5377
> [2] https://issues.apache.org/jira/browse/MESOS-5524
> [3] https://issues.apache.org/jira/browse/MESOS-5527
> [4] https://issues.apache.org/jira/browse/MESOS-4392
>


[GPU] [Allocation] "Scarce" Resource Allocation

2016-06-10 Thread Benjamin Mahler
I wanted to start a discussion about the allocation of "scarce" resources.
"Scarce" in this context means resources that are not present on every
machine. GPUs are the first example of a scarce resource that we support as
a known resource type.

Consider the behavior when there are the following agents in a cluster:

999 agents with (cpus:4,mem:1024,disk:1024)
1 agent with (gpus:1,cpus:4,mem:1024,disk:1024)

Here there are 1000 machines but only 1 has GPUs. We call GPUs a "scarce"
resource here because they are only present on a small percentage of the
machines.

We end up with some problematic behavior here with our current allocation
model:

(1) If a role wishes to use both GPU and non-GPU resources for tasks,
consuming 1 GPU will lead DRF to consider the role to have a 100% share of
the cluster, since it consumes 100% of the GPUs in the cluster. This
framework will then not receive any other offers.

(2) Because we do not have revocation yet, if a framework decides to
consume the non-GPU resources on a GPU machine, it will prevent the GPU
workloads from running!



I filed an epic [1] to track this. The plan for the short-term is to
introduce two mechanisms to mitigate these issues:

-Introduce a resource fairness exclusion list. This allows the shares
of resources like "gpus" to be excluded from the dominant share.

-Introduce a GPU_AWARE framework capability. This indicates that the
scheduler is aware of GPUs and will schedule tasks accordingly. Old
schedulers will not have the capability and will not receive any offers for
GPU machines. If a scheduler has the capability, we'll advise that they
avoid placing their additional non-GPU workloads on the GPU machines.



Longer term, we'll want a more robust way to manage scarce resources. The
first thought we had was to have sub-pools of resources based on machine
profile and perform fair sharing / quota within each pool. This addresses
(1) cleanly, and for (2) the operator needs to explicitly disallow non-GPU
frameworks from participating in the GPU pool.

Unfortunately, by excluding non-GPU frameworks from the GPU pool we may
have a lower level of utilization. In the even longer term, as we add
revocation it will be possible to allow a scheduler desiring GPUs to revoke
the resources allocated to the non-GPU workloads running on the GPU
machines. There are a number of things we need to put in place to support
revocation ([2], [3], [4], etc), so I'm glossing over the details here.

If anyone has any thoughts or insight in this area, please share!

Ben

[1] https://issues.apache.org/jira/browse/MESOS-5377
[2] https://issues.apache.org/jira/browse/MESOS-5524
[3] https://issues.apache.org/jira/browse/MESOS-5527
[4] https://issues.apache.org/jira/browse/MESOS-4392