Hi all,

We are about to setup a new installation using the following machines,
and CDH3 beta 3:

- 10 nodes of single quad core, 8GB memory, 2x500GB SATA
- 3 nodes of dual quad core, 24GB memory, 6x250GB SATA

We are finding our feet, and will blog tests, metrics etc as we go but
our initial usage patterns will be:

- initial load of 250 million records to HBase
- data harvesters pushing 300-600 records per second of insert or
update (under 1KB per record) to TABLE_1 in HBase
- MR job processing changed content in TABLE_1 into TABLE_2 on an
(e.g.) 6 hourly cron job (potentially using co-processors in the
future)
- MR job processing changed content in TABLE_2 into TABLE_3 on an
(e.g.) 6 hourly cron job (potentially using co-processors in the
future)
- MR jobs building Lucene, SOLR, PostGIS (hive+sqoop) indexes  on a
6,12 or 24 hourly cron job either by
  a) bulk export from HBase to .txt and then Hive or custom MR processing
  b) hive or custom MR processing straight from HBase tables as the input format
- MR jobs building analytical counts (e.g. 4 way "group bys" in SQL
using Hive) on 6,12,4 hourly cron either by
  a) bulk export from HBase to .txt and then Hive / custom MR processing
  b) hive, MR processing straight from HBase tables

To give an idea, at the moment on the 10 node cluster Hive against
.txt files does full scan in 3-4 minutes (our live system is Mysql and
we export to .txt for Hive)

I see we have 2 options, but I am inexperienced and seek any guidance:

a) run HDFS across all 13 nodes, MR on the 10 small nodes, region
servers on the 3 big nodes
  - MR will never benefit from data locality when using HBase (? I think)
b) run 2 completely separate clusters
  clu1: 10 nodes, HDFS, MR
  clu2: 3 nodes, HDFS, MR, RegionServer

With option b) we would do 6 hourly exports from clu2 -> clu1 and
really keep the processing load off the HBase cluster

We are prepared to run both, benchmark and provide metrics, but I
wonder if someone has some advice beforehand.

We are anticipating:
- NN, 2nd NN, JT on 3 of the 10 smaller nodes
- HBase master on 1 of the 3 big nodes
- 1 ZK daemon on 1 of the 3 big nodes (or should we go for an assemble
of 3, with one on each)

Thanks for any help anyone can provide,

Tim
(- and Lars F.)

Reply via email to