Hi J-D We have 8 drives (~500G per drive - total 4G) per machine
The metrics from my run indicate that I achieve around for writes - around 1 row(5k) in 2ms => 500 rows(5K) in 1 sec => 2.5 Mb/sec and from your the observation at StumbleUpon 200k rows (presuming 100 bytes per row)/sec => 20Mb/sec Wow !! that an order of difference I am sure disabling WAL during the writes is giving you a significant boost. Are you reading the data at the same time as you are writing? Thx Jacob On Fri, May 28, 2010 at 9:04 PM, Jean-Daniel Cryans <jdcry...@apache.org> wrote: >> What I wanted out of this discussion was to find out whether I am in the >> ballpark of what I can juice out of HBase or I am way off the mark. >> > > I understand... but this is a distributed system we're talking about. > Unless I have the same code, hbase/hadoop version, configuration, > number of nodes, cpu, RAM, # of HDDs, OS, network equipment, data set, > etc... it's really hard to assess right? For starters, I don't think > you specified the number of drives you have per machine, and HBase is > mostly IO-bound. > > FWIW, here's our experience. At StumbleUpon, we uploaded our main data > set consisting of 13B*2 rows on 20 machines (2xi7, 24GB (8 for HBase), > 4x 1TB JBOD) with MapReduce (using 8 maps per machine) pulling from a > MySQL cluster (we were selecting large ranges in batches), inserting > at an average rate of 150-200k rows per second, peaks at 1M. Our rows > are a few bytes, mostly integers and some text. We did it in the time > with HBase 0.20.3 + the parallel-put patch we wrote here (available in > trunk) with the configuration I pasted previously. For that upload the > WAL was disabled and ALL our tables are LZOed (can't stress enough the > importance of compressing your tables!) and 1GB max file size. > > My guess is yes you can juice it out more, first by using LZO ;) > > Also, are your machines even stressed during the test? Do you monitor? > Could you increase the number of clients? > > Sorry I can't give you a very clear answer, but without using a common > benchmark to compare numbers we're pretty much all in the dark. YCSB > is one, but IIRC it needs some patches to work efficiently (Todd > Lipcon from Cloudera has them in his github). > > J-D >