Hi Otis,

Perhaps I am getting this totally wrong, but here's how I look at it.

Let's say your problem as a whole needs X spindles + Y CPU cores + Z amount of 
RAM to make everything work out. Then, would it matter whether you divide that 
amount of resources (XYZ) over heterogeneous of homogeneous machines? If that 
doesn't make a noticeable difference, then both options should be fine. Then, 
how is it cheaper to get for example 40 spindles + 20 cores + 120GB RAM divided 
over 2 big + 2 small boxes compared to 5 equal boxes?

Here, I am reasoning with the assumption that when you need to scale out, you 
need to scale out both HBase and MapReduce capacity. If your HBase data size 
will stay fixed and the MapReduce capacity needs to grow independently, what 
you're saying makes more sense to me.


Cheers,
Friso



On 13 dec. 2011, at 20:44, Otis Gospodnetic wrote:

> Hi,
> 
> I was wondering if I could get some feedback on the craziness (or not) of 
> setting up a hybrid HBase-Hadoop cluster that has the following primary uses:
> 
> 1) continuous writes to HBase
> 2) disk and CPU intensive reads from HBase by MR jobs and writes of 
> aggregated data back to HBase by those jobs
> 3) occasional reads by people/reporting apps that read aggregates from HBase
> 
> I'm calling this hybrid HBase-Hadoop cluster because not all nodes in the 
> cluster would be running both a RegionServer and DataNode + TaskTracker.
> Instead, this is what it could look like:
> 
> * a set of *larger* nodes running RegionServers, DataNodes, TaskTrackers 
> (e.g., large EC2 instances)
> * a set of *smaller* nodes running only DNs and TTs, but *not* RSs (e.g. 
> small EC2 instances)
> 
> 
> The thinking here is that because that 2) above needs to process a lot of 
> data (lots of reads, good amount of writes, and relatively CPU intensive) 
> it's nice to have more nodes and spindles.
> But if we put RSs on all nodes to put it close to DNs, then all nodes need to 
> be relatively beefy in terms of RAM to keep HBase happy, and that translates 
> to more $$$.
> So the thinking/hope is that one could save $ by having more smaller/cheaper 
> nodes to do the disk IO and CPU intensive work, while having just enough RS 
> instances on the big nodes to handle the HBase side of 1) 2) and 3) above.
> 
> 
> Is the above setup crazy?
> 
> Are there some obvious flaws that would really cause operational of 
> performance pains?
> Would such a cluster have major performance issues because of data that needs 
> to be transferred between DNs that are on all nodes and RSs running only on 
> the big nodes?
> 
> 
> Thanks,
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/

Reply via email to