> What I wanted out of this discussion was to find out whether I am in the > ballpark of what I can juice out of HBase or I am way off the mark. >
I understand... but this is a distributed system we're talking about. Unless I have the same code, hbase/hadoop version, configuration, number of nodes, cpu, RAM, # of HDDs, OS, network equipment, data set, etc... it's really hard to assess right? For starters, I don't think you specified the number of drives you have per machine, and HBase is mostly IO-bound. FWIW, here's our experience. At StumbleUpon, we uploaded our main data set consisting of 13B*2 rows on 20 machines (2xi7, 24GB (8 for HBase), 4x 1TB JBOD) with MapReduce (using 8 maps per machine) pulling from a MySQL cluster (we were selecting large ranges in batches), inserting at an average rate of 150-200k rows per second, peaks at 1M. Our rows are a few bytes, mostly integers and some text. We did it in the time with HBase 0.20.3 + the parallel-put patch we wrote here (available in trunk) with the configuration I pasted previously. For that upload the WAL was disabled and ALL our tables are LZOed (can't stress enough the importance of compressing your tables!) and 1GB max file size. My guess is yes you can juice it out more, first by using LZO ;) Also, are your machines even stressed during the test? Do you monitor? Could you increase the number of clients? Sorry I can't give you a very clear answer, but without using a common benchmark to compare numbers we're pretty much all in the dark. YCSB is one, but IIRC it needs some patches to work efficiently (Todd Lipcon from Cloudera has them in his github). J-D