[jira] [Commented] (YARN-8320) [Umbrella] Support CPU isolation for latency-sensitive (LS) service

Miklos Szegedi (JIRA) Wed, 06 Jun 2018 13:11:36 -0700


    [ 
https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503850#comment-16503850
 ]


Miklos Szegedi commented on YARN-8320:
--------------------------------------

[~cheersyang], it makes sense in general. I think what you are missing here is 
if you use non-strict mode with simple weights {{cpu.shares}}, those are 
applied per thread. Indeed there is a design issue in the existing 
{{CGroupsCPUResourcesHandler}} that it applies {{cpu.shares=vcores}} to each 
thread of a guaranteed container.

Given this request in your example above, you will actually get 10 * N * N cpu 
time with the current code that is wrong.
{code:java}
Request1:
  #vcore: 10 * N
  #cpu: N (0<N<=10){code}
How about another solution? Let's forget about vcores first, so all you need to 
set in cgroups in case of exclusive mode is: (this would be the advanced setup)
{code:java}
Request1:
  #cpu.shares: 1024
  #cpuset: 9
Request2:
  #cpu.shares: 1024
  #cpuset: 1{code}
If you prefer the simple setup the client just needs to specify the same 
request with vcores:
{code:java}
Request1:
  #vcores: 9*1024
Request2:
  #vcores: 1*1024{code}
Inside {{CGroupsCPUResourcesHandler}} and {{CGroupsCPUSetResourcesHandler}} 
these would translate to the following cgroup settings:
{code:java}
Request1:
  #cpu.shares: (#vcores + 1023) % 1024 + 1 = 1024
  #cpuset: (#vcores + 1023) / 1024 = 9
Request2:
  #cpu.shares: #vcores % 1024 + 1 = 1024
  #cpuset: (#vcores + 1023) / 1024 = 1{code}
Each thread will get the weight and in case of exclusive mode all cpus will be 
saturated.

This solution:
 # Provides the simple setup that you prefer. You only need to specify vcores 
only.
 # Provides custom resource type setup for advanced use specifying cpu.shares 
and cpuset in the future.
 # No special flag needs to be added to the container launch context.
 # It is backward compatible.
 # It allows RESERVED/SHARED/ANY mode by setting {{vcores: 512}} for example 
for shared/opportunistic containers. This would set {{cpu.shares: 512, cpuset: 
1}} sharing a CPU with another request of {{vcores: 512 and it even allows 
precise weighting}}.
 # It works with oversubscription by defaulting to {{cpu.shares=1}}
 # Drawback: this does not provide a way to request multiple shared cpus for 
example. That could be a special advanced feature in the future specifying 
resource types.
 # Work: It needs to get {{CGroupsCPUResourcesHandler}} fixed to set weight to 
constant
 # Work: The advanced setup needs to make cpu.shares a second level resource 
type introducing something like a resource tee for the future.

 

> [Umbrella] Support CPU isolation for latency-sensitive (LS) service
> -------------------------------------------------------------------
>
>                 Key: YARN-8320
>                 URL: https://issues.apache.org/jira/browse/YARN-8320
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: nodemanager
>            Reporter: Jiandan Yang 
>            Priority: Major
>         Attachments: CPU-isolation-for-latency-sensitive-services-v1.pdf, 
> CPU-isolation-for-latency-sensitive-services-v2.pdf, YARN-8320.001.patch
>
>
> Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and 
> “cpu.shares” to isolate cpu resource. However,
>  * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; 
> no support for differentiated latency
>  * Request latency of services running on container may be frequent shake 
> when all containers share cpus, and latency-sensitive services can not afford 
> in our production environment.
> So we need more fine-grained cpu isolation.
> Here we propose a solution using cgroup cpuset to binds containers to 
> different processors, this is inspired by the isolation technique in [Borg 
> system|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YARN-8320) [Umbrella] Support CPU isolation for latency-sensitive (LS) service

Reply via email to