[
https://issues.apache.org/jira/browse/YARN-5202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nathan Roberts updated YARN-5202:
---------------------------------
Attachment: (was: YARN-5202.patch)
> Dynamic Overcommit of Node Resources - POC
> ------------------------------------------
>
> Key: YARN-5202
> URL: https://issues.apache.org/jira/browse/YARN-5202
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: nodemanager, resourcemanager
> Affects Versions: 3.0.0-alpha1
> Reporter: Nathan Roberts
> Assignee: Nathan Roberts
>
> This Jira is to present a proof-of-concept implementation (collaboration
> between [~jlowe] and myself) of a dynamic over-commit implementation in YARN.
> The type of over-commit implemented in this jira is similar to but not as
> full-featured as what's being implemented via YARN-1011. YARN-1011 is where
> we see ourselves heading but we needed something quick and completely
> transparent so that we could test it at scale with our varying workloads
> (mainly MapReduce, Spark, and Tez). Doing so has shed some light on how much
> additional capacity we can achieve with over-commit approaches, and has
> fleshed out some of the problems these approaches will face.
> Primary design goals:
> - Avoid changing protocols, application frameworks, or core scheduler logic,
> - simply adjust individual nodes' available resources based on current node
> utilization and then let scheduler do what it normally does
> - Over-commit slowly, pull back aggressively - If things are looking good and
> there is demand, slowly add resource. If memory starts to look over-utilized,
> aggressively reduce the amount of over-commit.
> - Make sure the nodes protect themselves - i.e. if memory utilization on a
> node gets too high, preempt something - preferably something from a
> preemptable queue
> A patch against trunk will be attached shortly. Some notes on the patch:
> - This feature was originally developed against something akin to 2.7. Since
> the patch is mainly to explain the approach, we didn't do any sort of testing
> against trunk except for basic build and basic unit tests
> - The key pieces of functionality are in {{SchedulerNode}},
> {{AbstractYarnScheduler}}, and {{NodeResourceMonitorImpl}}. The remainder of
> the patch is mainly UI, Config, Metrics, Tests, and some minor code
> duplication (e.g. to optimize node resource changes we treat an over-commit
> resource change differently than an updateNodeResource change - i.e.
> remove_node/add_node is just too expensive for the frequency of over-commit
> changes)
> - We only over-commit memory at this point.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]