Carlo Curino commented on YARN-2670:

I really welcome this line of thinking. 

We did some work in this space, and demonstrated that under certain 
experimental conditions a feedback loop on instantaneous cluster conditions, 
when coupled with proper extensions of the scheduler can lead to substantial 
perf improvements (the Limplock paper http://dl.acm.org/citation.cfm?id=2523627 
discuss related ideas).

This is in particular relevant, as YARN does not track all resources (e.g., no 
disk, net bookeeping/policing). Also this is needed to account for load 
produced by other services running on the box, but not managed by YARN, e.g., 
HDFS / HBase.

I look forward to hear more about Astro and how you are attacking this, do you 
have any document or initial patch for this?

> Adding feedback capability to capacity scheduler from external systems
> ----------------------------------------------------------------------
>                 Key: YARN-2670
>                 URL: https://issues.apache.org/jira/browse/YARN-2670
>             Project: Hadoop YARN
>          Issue Type: New Feature
>            Reporter: Mayank Bansal
>            Assignee: Mayank Bansal
> The sheer growth in data volume and Hadoop cluster size make it a significant 
> challenge to diagnose and locate problems in a production-level cluster 
> environment efficiently and within a short period of time. Often times, the 
> distributed monitoring systems are not capable of detecting a problem well in 
> advance when a large-scale Hadoop cluster starts to deteriorate in 
> performance or becomes unavailable. Thus, incoming workloads, scheduled 
> between the time when cluster starts to deteriorate and the time when the 
> problem is identified, suffer from longer execution times. As a result, both 
> reliability and throughput of the cluster reduce significantly. we address 
> this problem by proposing a system called Astro, which consists of a 
> predictive model and an extension to the Capacity scheduler. The predictive 
> model in Astro takes into account a rich set of cluster behavioral 
> information that are collected by monitoring processes and model them using 
> machine learning algorithms to predict future behavior of the cluster. The 
> Astro predictive model detects anomalies in the cluster and also identifies a 
> ranked set of metrics that have contributed the most towards the problem. The 
> Astro scheduler uses the prediction outcome and the list of metrics to decide 
> whether it needs to move and reduce workloads from the problematic cluster 
> nodes or to prevent additional workload allocations to them, in order to 
> improve both throughput and reliability of the cluster.
> This JIRA is only for adding feedback capabilities to Capacity Scheduler 
> which can take feedback from external systems.

This message was sent by Atlassian JIRA

Reply via email to