Do you really have to retrieve all 200.000 each time?
Scan.setBatch(...) makes no difference?! (note that batching is different and 
separate from caching).

Also note that the scanner contract is to return sorted KVs, so a single scan 
cannot be parallelized across RegionServers (well not entirely true, it could 
be farmed off in parallel and then be presented to the client in the right 
order - but HBase is not doing that). That is why one vs 12 RSs makes no 
difference in this scenario.

In the 12 node case you'll see low CPU on all but one RS, and each RS will get 
its turn.

In your case this is scanning 20.000.000 KVs serially in 400s, that's 50000 
KVs/s, which - depending on hardware - is not too bad for HBase (but not great 
either).

If you only ever expect to run a single query like this on top your cluster 
(i.e. your concern is latency not throughput) you can do multiple RPCs in 
parallel for a sub portion of your key range. Together with batching can start 
using value before all is streamed back from the server.


-- Lars



----- Original Message -----
From: Gurjeet Singh <[email protected]>
To: [email protected]
Cc: 
Sent: Saturday, August 11, 2012 11:04 PM
Subject: Slow full-table scans

Hi,

I am trying to read all the data out of an HBase table using a scan
and it is extremely slow.

Here are some characteristics of the data:

1. The total table size is tiny (~200MB)
2. The table has ~100 rows and ~200,000 columns in a SINGLE family.
Thus the size of each cell is ~10bytes and the size of each row is
~2MB
3. Currently scanning the whole table takes ~400s (both in a
distributed setting with 12 nodes or so and on a single node), thus
5sec/row
4. The row keys are unique 8 byte crypto hashes of sequential numbers
5. The scanner is set to fetch a FULL row at a time (scan.setBatch)
and is set to fetch 100MB of data at a time (scan.setCaching)
6. Changing the caching size seems to have no effect on the total scan
time at all
7. The column family is setup to keep a single version of the cells,
no compression, and no block cache.

Am I missing something ? Is there a way to optimize this ?

I guess a general question I have is whether HBase is good datastore
for storing many medium sized (~50GB), dense datasets with lots of
columns when a lot of the queries require full table scans ?

Thanks!
Gurjeet

Reply via email to