Thanks, I will use these results as a baseline and see what I can do to tweak them.
-Pete -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Jean-Daniel Cryans Sent: Monday, December 06, 2010 5:01 PM To: [email protected] Subject: Re: Make it quicker The speed would really depend on the size of the rows, which is all the values plus all the keys (row, family, qualifier, timestamp) for each of those values. For example, if your rows are a total of 500 bytes each, you have to pull about 300MB which means that the throughput would be 33MB/s, which is good considering you're going through the network for non-local data and that it requires multiple RPCs to fetch all that data... but that's just an example. Usual optimizations: - use scanner caching http://hbase.apache.org/docs/r0.89.20100924/apidocs/org/apache/hadoop/hbase/client/Scan.html#setCaching(int) - use LZO - only retrieve the columns you need - use the smallest keys possible Hope that helps, J-D On Mon, Dec 6, 2010 at 2:02 PM, Peter Haidinyak <[email protected]> wrote: > Hi y'all, > Ok, I put about 2.5 million rows into HBase that is running on three > machines (2 region servers and 1 name node, etc). The row id is the date plus > a number that increments. ('20101201|0000001'). From a java client I do a > scan with the starting row and ending row for one days logs (the last 627k > rows in HBase). > Right now the scan runs in about 9 seconds to process 627k > rows. For commodity servers is the about normal? Also, where can I learn how > to optimize this process? > > Thanks again. > > -Pete >
