Thanks, I will use these results as a baseline and see what I can do to tweak 
them.

-Pete


-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Jean-Daniel 
Cryans
Sent: Monday, December 06, 2010 5:01 PM
To: [email protected]
Subject: Re: Make it quicker

The speed would really depend on the size of the rows, which is all
the values plus all the keys (row, family, qualifier, timestamp) for
each of those values. For example, if your rows are a total of 500
bytes each, you have to pull about 300MB which means that the
throughput would be 33MB/s, which is good considering you're going
through the network for non-local data and that it requires multiple
RPCs to fetch all that data... but that's just an example.

Usual optimizations:

 - use scanner caching
http://hbase.apache.org/docs/r0.89.20100924/apidocs/org/apache/hadoop/hbase/client/Scan.html#setCaching(int)
 - use LZO
 - only retrieve the columns you need
 - use the smallest keys possible

Hope that helps,

J-D

On Mon, Dec 6, 2010 at 2:02 PM, Peter Haidinyak <[email protected]> wrote:
> Hi y'all,
>  Ok, I put about 2.5 million rows into HBase that is running on three 
> machines (2 region servers and 1 name node, etc). The row id is the date plus 
> a number that increments. ('20101201|0000001'). From a java client I do a 
> scan with the starting row and ending row for one days logs (the last 627k 
> rows in HBase).
>                Right now the scan runs in about 9 seconds to process 627k 
> rows. For commodity servers is the about normal? Also, where can I learn how 
> to optimize this process?
>
> Thanks again.
>
> -Pete
>

Reply via email to