[
https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16493538#comment-16493538
]
Weiwei Yang commented on YARN-8320:
-----------------------------------
Thanks [[email protected]] for sharing your idea. You were right
that the original idea was to make this easy to use. That says user doesn't
need to know about what set of cpus their containers will be running on, and
how they are configured. They just give us a cpu_share_mode, and we do all the
tricks underneath without exposing too much details.
My concern about the approach you suggested is
# It might be complex for user to use
# It should be able to support 2 modes but not very straightforward to support
4 modes
Allow me take an example like following:
{noformat}
I have a NM with capacity:
memory: 10gb
vcore: 10
cpus: 10 (0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
Request with just cpu number:
memory: 1gb
vcore: 5
cpuset: 5
After allocation, my NM capacity updates to
memory: 9gb
vcore: 5
cpus: 5 (0, 1, 2, 3, 4)
{noformat}
there are few problems with such approach
# User might get confused how many cpus to apply in the resource request.
Vcore as of today is already a difficult thing to set, adding a new type of
resource might make this harder.
# When #vcore is not same as #processor on NM, user will need do some
calculation to set a reasonable cpuset value in order not to over/less use cpu
resource, and this is hard for RM to check as it doesn't have all the info like
NM did
# Difficult to support all 4 modes under current resource APIs
Please let me know if there is any wrong in this example and the comments.
I agree we can start from supporting EXCLUSIVE+ANY mode in phase 1, but still
want to make sure the design is able to extend to support both modes (because
RESERVED/SHARE modes are very useful to improve utilizations). I will
consolidate all the comments from you and [~leftnoteasy] and come up with a new
version of design doc next week. Look forward for your comments always.
Thanks
> [Umbrella] Support CPU isolation for latency-sensitive (LS) service
> -------------------------------------------------------------------
>
> Key: YARN-8320
> URL: https://issues.apache.org/jira/browse/YARN-8320
> Project: Hadoop YARN
> Issue Type: New Feature
> Components: nodemanager
> Reporter: Jiandan Yang
> Priority: Major
> Attachments: CPU-isolation-for-latency-sensitive-services-v1.pdf,
> CPU-isolation-for-latency-sensitive-services-v2.pdf, YARN-8320.001.patch
>
>
> Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and
> “cpu.shares” to isolate cpu resource. However,
> * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler;
> no support for differentiated latency
> * Request latency of services running on container may be frequent shake
> when all containers share cpus, and latency-sensitive services can not afford
> in our production environment.
> So we need more fine-grained cpu isolation.
> Here we propose a solution using cgroup cpuset to binds containers to
> different processors, this is inspired by the isolation technique in [Borg
> system|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf].
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]