Do you really have to retrieve all 200.000 each time? Scan.setBatch(...) makes no difference?! (note that batching is different and separate from caching).
Also note that the scanner contract is to return sorted KVs, so a single scan cannot be parallelized across RegionServers (well not entirely true, it could be farmed off in parallel and then be presented to the client in the right order - but HBase is not doing that). That is why one vs 12 RSs makes no difference in this scenario. In the 12 node case you'll see low CPU on all but one RS, and each RS will get its turn. In your case this is scanning 20.000.000 KVs serially in 400s, that's 50000 KVs/s, which - depending on hardware - is not too bad for HBase (but not great either). If you only ever expect to run a single query like this on top your cluster (i.e. your concern is latency not throughput) you can do multiple RPCs in parallel for a sub portion of your key range. Together with batching can start using value before all is streamed back from the server. -- Lars ----- Original Message ----- From: Gurjeet Singh <[email protected]> To: [email protected] Cc: Sent: Saturday, August 11, 2012 11:04 PM Subject: Slow full-table scans Hi, I am trying to read all the data out of an HBase table using a scan and it is extremely slow. Here are some characteristics of the data: 1. The total table size is tiny (~200MB) 2. The table has ~100 rows and ~200,000 columns in a SINGLE family. Thus the size of each cell is ~10bytes and the size of each row is ~2MB 3. Currently scanning the whole table takes ~400s (both in a distributed setting with 12 nodes or so and on a single node), thus 5sec/row 4. The row keys are unique 8 byte crypto hashes of sequential numbers 5. The scanner is set to fetch a FULL row at a time (scan.setBatch) and is set to fetch 100MB of data at a time (scan.setCaching) 6. Changing the caching size seems to have no effect on the total scan time at all 7. The column family is setup to keep a single version of the cells, no compression, and no block cache. Am I missing something ? Is there a way to optimize this ? I guess a general question I have is whether HBase is good datastore for storing many medium sized (~50GB), dense datasets with lots of columns when a lot of the queries require full table scans ? Thanks! Gurjeet
