Hi,

I was wondering if I could get some feedback on the craziness (or not) of 
setting up a hybrid HBase-Hadoop cluster that has the following primary uses:

1) continuous writes to HBase
2) disk and CPU intensive reads from HBase by MR jobs and writes of aggregated 
data back to HBase by those jobs
3) occasional reads by people/reporting apps that read aggregates from HBase

I'm calling this hybrid HBase-Hadoop cluster because not all nodes in the 
cluster would be running both a RegionServer and DataNode + TaskTracker.
Instead, this is what it could look like:

* a set of *larger* nodes running RegionServers, DataNodes, TaskTrackers (e.g., 
large EC2 instances)
* a set of *smaller* nodes running only DNs and TTs, but *not* RSs (e.g. small 
EC2 instances)


The thinking here is that because that 2) above needs to process a lot of data 
(lots of reads, good amount of writes, and relatively CPU intensive) it's nice 
to have more nodes and spindles.
But if we put RSs on all nodes to put it close to DNs, then all nodes need to 
be relatively beefy in terms of RAM to keep HBase happy, and that translates to 
more $$$.
So the thinking/hope is that one could save $ by having more smaller/cheaper 
nodes to do the disk IO and CPU intensive work, while having just enough RS 
instances on the big nodes to handle the HBase side of 1) 2) and 3) above.


Is the above setup crazy?

Are there some obvious flaws that would really cause operational of performance 
pains?
Would such a cluster have major performance issues because of data that needs 
to be transferred between DNs that are on all nodes and RSs running only on the 
big nodes?


Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

Reply via email to