Hi Lars, Yes, I need to retrieve all the values for a row at a time. That said, I did experiment with different batch sizes and that made no difference whatsoever. (caching on the other hand did make some difference ~2-3% faster for larger cache)
I see your point about scanners returning sorted KVs. In my application, I simply don't care whether the results are sorted or not and I know the key range in advance. This is a great suggestion. Let me try replacing a single scan with a list of GETs or a bunch of SCANs with different start/stop rows. Thanks! Gurjeet On Sun, Aug 12, 2012 at 3:24 PM, lars hofhansl <[email protected]> wrote: > Do you really have to retrieve all 200.000 each time? > Scan.setBatch(...) makes no difference?! (note that batching is different and > separate from caching). > > Also note that the scanner contract is to return sorted KVs, so a single scan > cannot be parallelized across RegionServers (well not entirely true, it could > be farmed off in parallel and then be presented to the client in the right > order - but HBase is not doing that). That is why one vs 12 RSs makes no > difference in this scenario. > > In the 12 node case you'll see low CPU on all but one RS, and each RS will > get its turn. > > In your case this is scanning 20.000.000 KVs serially in 400s, that's 50000 > KVs/s, which - depending on hardware - is not too bad for HBase (but not > great either). > > If you only ever expect to run a single query like this on top your cluster > (i.e. your concern is latency not throughput) you can do multiple RPCs in > parallel for a sub portion of your key range. Together with batching can > start using value before all is streamed back from the server. > > > -- Lars > > > > ----- Original Message ----- > From: Gurjeet Singh <[email protected]> > To: [email protected] > Cc: > Sent: Saturday, August 11, 2012 11:04 PM > Subject: Slow full-table scans > > Hi, > > I am trying to read all the data out of an HBase table using a scan > and it is extremely slow. > > Here are some characteristics of the data: > > 1. The total table size is tiny (~200MB) > 2. The table has ~100 rows and ~200,000 columns in a SINGLE family. > Thus the size of each cell is ~10bytes and the size of each row is > ~2MB > 3. Currently scanning the whole table takes ~400s (both in a > distributed setting with 12 nodes or so and on a single node), thus > 5sec/row > 4. The row keys are unique 8 byte crypto hashes of sequential numbers > 5. The scanner is set to fetch a FULL row at a time (scan.setBatch) > and is set to fetch 100MB of data at a time (scan.setCaching) > 6. Changing the caching size seems to have no effect on the total scan > time at all > 7. The column family is setup to keep a single version of the cells, > no compression, and no block cache. > > Am I missing something ? Is there a way to optimize this ? > > I guess a general question I have is whether HBase is good datastore > for storing many medium sized (~50GB), dense datasets with lots of > columns when a lot of the queries require full table scans ? > > Thanks! > Gurjeet >
