Hi all, We are about to setup a new installation using the following machines, and CDH3 beta 3:
- 10 nodes of single quad core, 8GB memory, 2x500GB SATA - 3 nodes of dual quad core, 24GB memory, 6x250GB SATA We are finding our feet, and will blog tests, metrics etc as we go but our initial usage patterns will be: - initial load of 250 million records to HBase - data harvesters pushing 300-600 records per second of insert or update (under 1KB per record) to TABLE_1 in HBase - MR job processing changed content in TABLE_1 into TABLE_2 on an (e.g.) 6 hourly cron job (potentially using co-processors in the future) - MR job processing changed content in TABLE_2 into TABLE_3 on an (e.g.) 6 hourly cron job (potentially using co-processors in the future) - MR jobs building Lucene, SOLR, PostGIS (hive+sqoop) indexes on a 6,12 or 24 hourly cron job either by a) bulk export from HBase to .txt and then Hive or custom MR processing b) hive or custom MR processing straight from HBase tables as the input format - MR jobs building analytical counts (e.g. 4 way "group bys" in SQL using Hive) on 6,12,4 hourly cron either by a) bulk export from HBase to .txt and then Hive / custom MR processing b) hive, MR processing straight from HBase tables To give an idea, at the moment on the 10 node cluster Hive against .txt files does full scan in 3-4 minutes (our live system is Mysql and we export to .txt for Hive) I see we have 2 options, but I am inexperienced and seek any guidance: a) run HDFS across all 13 nodes, MR on the 10 small nodes, region servers on the 3 big nodes - MR will never benefit from data locality when using HBase (? I think) b) run 2 completely separate clusters clu1: 10 nodes, HDFS, MR clu2: 3 nodes, HDFS, MR, RegionServer With option b) we would do 6 hourly exports from clu2 -> clu1 and really keep the processing load off the HBase cluster We are prepared to run both, benchmark and provide metrics, but I wonder if someone has some advice beforehand. We are anticipating: - NN, 2nd NN, JT on 3 of the 10 smaller nodes - HBase master on 1 of the 3 big nodes - 1 ZK daemon on 1 of the 3 big nodes (or should we go for an assemble of 3, with one on each) Thanks for any help anyone can provide, Tim (- and Lars F.)
