[ https://issues.apache.org/jira/browse/YARN-2670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385978#comment-14385978 ]
Carlo Curino commented on YARN-2670: ------------------------------------ I really welcome this line of thinking. We did some work in this space, and demonstrated that under certain experimental conditions a feedback loop on instantaneous cluster conditions, when coupled with proper extensions of the scheduler can lead to substantial perf improvements (the Limplock paper http://dl.acm.org/citation.cfm?id=2523627 discuss related ideas). This is in particular relevant, as YARN does not track all resources (e.g., no disk, net bookeeping/policing). Also this is needed to account for load produced by other services running on the box, but not managed by YARN, e.g., HDFS / HBase. I look forward to hear more about Astro and how you are attacking this, do you have any document or initial patch for this? > Adding feedback capability to capacity scheduler from external systems > ---------------------------------------------------------------------- > > Key: YARN-2670 > URL: https://issues.apache.org/jira/browse/YARN-2670 > Project: Hadoop YARN > Issue Type: New Feature > Reporter: Mayank Bansal > Assignee: Mayank Bansal > > The sheer growth in data volume and Hadoop cluster size make it a significant > challenge to diagnose and locate problems in a production-level cluster > environment efficiently and within a short period of time. Often times, the > distributed monitoring systems are not capable of detecting a problem well in > advance when a large-scale Hadoop cluster starts to deteriorate in > performance or becomes unavailable. Thus, incoming workloads, scheduled > between the time when cluster starts to deteriorate and the time when the > problem is identified, suffer from longer execution times. As a result, both > reliability and throughput of the cluster reduce significantly. we address > this problem by proposing a system called Astro, which consists of a > predictive model and an extension to the Capacity scheduler. The predictive > model in Astro takes into account a rich set of cluster behavioral > information that are collected by monitoring processes and model them using > machine learning algorithms to predict future behavior of the cluster. The > Astro predictive model detects anomalies in the cluster and also identifies a > ranked set of metrics that have contributed the most towards the problem. The > Astro scheduler uses the prediction outcome and the list of metrics to decide > whether it needs to move and reduce workloads from the problematic cluster > nodes or to prevent additional workload allocations to them, in order to > improve both throughput and reliability of the cluster. > This JIRA is only for adding feedback capabilities to Capacity Scheduler > which can take feedback from external systems. -- This message was sent by Atlassian JIRA (v6.3.4#6332)