Hi Tatsuya, On Thu, Jun 3, 2010 at 5:06 PM, Tatsuya Kawano <[email protected]>wrote:
> Hello, > > I remember Jon was talking other day that he was trying a single HBase > server with existing HDFS cluster to serve map reduce (MR) results. I wonder > if this went well or not. > > A couple of friends in Tokyo are considering HBase to do a similar thing. > They want to serve MR results inside the clients' companies via HBase. They > both have existing MR/HDFS emvironment; one has a small (< 10) and another > has a large (> 50) clusters. > > They'll use the incremental loading to existing table (HBASE-1923) to add > the MR results to the HBase table, and only few users will read and export > (web CSV download) the results via HBase. So HBase will be lightly loaded. > They probably won't even need high availability (HA) option on HBase. > > So I'm thinking to recommend them to add just one server (non-HA) or two > servers (HA) to their Hadoop cluster, and run only HMaster and Region Server > processes on the server(s). The HBase cluster will utilize the existing > (small or large) HDFS cluster and ZooKeeper ensemble. > > If your "exported dataset" from the MR job is small enough to fit on one server, you can certainly use a single HBase RS plus the bulk load functionality. However, with a small dataset like that it might make more sense to simply export TSV/CSV and then use a tool like Sqoop to export to a relational database. That way you'd have better off the shelf integration with various other tools or access methods. > The server spec will be 2 x 8-core processors and 8GB to 24GB RAM. The RAM > size will be change depending on the data volume and access pattern. > > Has anybody tried a similar configuration? and how it goes? > > > Also, I saw Jon's slides for Hadoop World in NYC 2009, and it was said that > I'd better to have at least 5 Region Servers / Data Nodes in my cluster to > get the typical performance. If I deploy RS and DN on separate servers, > which one should be >= 5 nodes? DN? RS? or both? > > Better to colocate the DNs and RSs for most deployments. You get significantly better random read performance for uncached data. -Todd > > Thanks, > Tatsuya Kawano > Tokyo, Japan > > > > -- Todd Lipcon Software Engineer, Cloudera
