[ 
https://issues.apache.org/jira/browse/WHIRR-238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994295#comment-12994295
 ] 

David Alves commented on WHIRR-238:
-----------------------------------

Some additional thoughts:

Use Sigar for generic metrics gathering (they recently changed to Apache 
license 2.0)
The whirr cli starts a monitor process in all the cluster nodes.
The monitor process is passed the addresses of the zk ensemble.
The monitor process uses zk's ephemeral nodes to elect a leader that will be 
the coordinator.
Each monitor will gather generic metrics from sigar and will optionally accept 
application specific metrics gatherers (e.g. reqs/sec in HBase, or current jobs 
in Hadoop).
Each monitor will feed these metrics embedded with its own unique id to the 
elected coordinator (through Avro).
The coordinator will gather metrics and feed them to a scaling rule engine.
The rule engine may be generic, i.e., only based on load, but can be receive 
(be replaced? use composite?) application specific rule engines.
There should exist configurable and default hard limits for scaling (min,max) 
so that coordinator bugs would not cause clusters to wrongly scale to huge 
numbers.
All non-volatile coordinator state is to be kept in zk.
When a coordinator node leaves the cluster another one is elected and starts 
where the previous left off (probably not problematic to loose some metrics).
Maybe in the future there could be some simple DDL for rules but for now just 
keep things simple, rule engines receive metrics and programatically start/stop 
nodes using whirr's cli.




> Scaling Monitor/Coordinator
> ---------------------------
>
>                 Key: WHIRR-238
>                 URL: https://issues.apache.org/jira/browse/WHIRR-238
>             Project: Whirr
>          Issue Type: New Feature
>          Components: core
>            Reporter: David Alves
>
> From the mailing list:
> General idea:
> Add an elastic scaling monitor and coordinator, i.e. a whirr process that 
> would be running on some or all of the nodes that:
>       - would collect load metrics (both generic and specific to each 
> application)
>       - would feed them through an elastic decision making engine (also 
> specific to each application as it depends on the specific metrics)
>       - would then act on those decisions by either expanding or contracting 
> the cluster.
>       Some specifics:
>       - it must not be completely distributed, i.e. it can have a specific 
> assigned node that will monitor/coordinate but this node must not be fixed, 
> i.e. it could/should change if the previous coordinator leaves the cluster.
>       - each application would define the set of metrics that it emits and 
> use a local monitor process to feed them to the coordinator.
>       - the monitor process should emit some standard metrics (Disk I/O, CPU 
> Load, Net I/O, memory)
>       - the coordinator would have a pluggable decision engine policy also 
> defined by the application that would consume metrics and make a decision.
>       - whirr would take care of requesting/releasing nodes and 
> adding/removing them from the relevant services.
>       Some implementation ideas:
>       - it could tun on top of zookeeper. zk is already a requirement for 
> several services and would allow to reliably store coordinator state so that 
> another node can pickup if the previous coordinator leaves the cluster.
>       - it could use Avro to serialize/deserialize metrics data 
>       - it should be optional, i.e. simply another service that the whirr cli 
> starts
>       - it would also be nice to have a monitor/coordinator web page that 
> would display metrics and view cluster status in an aggregated view.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to