[jira] [Commented] (YARN-8320) [Umbrella] Support CPU isolation for latency-sensitive (LS) service

Weiwei Yang (JIRA) Fri, 08 Jun 2018 05:14:39 -0700


    [ 
https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16505950#comment-16505950
 ]


Weiwei Yang commented on YARN-8320:
-----------------------------------

Hi [[email protected]]

The solution you proposed looks like quite interesting, especially the idea 
about cpu.share. Please see my comments

If use the "simple" approach you shared,

1. Such request
{noformat}
Request1:
  #vcores: 9*1024
{noformat}
this breaks the basic semantic for vcore we've been using for years, this is 
core API level incompatible.

2. For the formula you gave,
{noformat}
  #cpu.shares: (#vcores + 1023) % 1024 + 1 = 1024
{noformat}
What if I specify #vcore=8*512 (=4 * 1024), what #cpu.shares and #cpuset will 
be? I don't think you can get 2 var result from 1 var input.

If we consider #cpu.shares and #cpuset as resources,
 # NM and RM needs to know all info about physical processors, including their 
(virtual) shares, this introduces extra complexity for both NM and RM. 
Moreover, current resource API doesn't support such multidimensional resource.
 # Precise weighting sounds like an interesting idea, but I doubt if we really 
need that much. On our online systems, we don't really control in that 
fine-grained.
 # It will be hard for user to setup {{cpu.share}}, how would an user know what 
value is meaningful. And what if some user just set some too big or small value 
in their requests? They will get un-predictable results.

We can have some more offline chats about this, thanks for bringing up the idea.
 Thanks.

> [Umbrella] Support CPU isolation for latency-sensitive (LS) service
> -------------------------------------------------------------------
>
>                 Key: YARN-8320
>                 URL: https://issues.apache.org/jira/browse/YARN-8320
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: nodemanager
>            Reporter: Jiandan Yang 
>            Priority: Major
>         Attachments: CPU-isolation-for-latency-sensitive-services-v1.pdf, 
> CPU-isolation-for-latency-sensitive-services-v2.pdf, YARN-8320.001.patch
>
>
> Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and 
> “cpu.shares” to isolate cpu resource. However,
>  * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; 
> no support for differentiated latency
>  * Request latency of services running on container may be frequent shake 
> when all containers share cpus, and latency-sensitive services can not afford 
> in our production environment.
> So we need more fine-grained cpu isolation.
> Here we propose a solution using cgroup cpuset to binds containers to 
> different processors, this is inspired by the isolation technique in [Borg 
> system|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YARN-8320) [Umbrella] Support CPU isolation for latency-sensitive (LS) service

Reply via email to