It seems like the client code just sits idle, waiting for data from the regionservers.
Gurjeet On Sun, Aug 12, 2012 at 4:13 PM, Jacques <[email protected]> wrote: > I think the first question is where is the time spent. Does your analysis > show that all the time spent is on the regionservers or is a portion of the > bottleneck on the client side? > > Jacques > > > > On Sun, Aug 12, 2012 at 4:00 PM, Mohammad Tariq <[email protected]> wrote: > >> Methods getStartKey and getEndKey provided by HRegionInfo class can used >> for that purpose. >> Also, please make sure, any HTable instance is not left opened once you are >> are done with reads. >> Regards, >> Mohammad Tariq >> >> >> >> On Mon, Aug 13, 2012 at 4:22 AM, Gurjeet Singh <[email protected]> wrote: >> >> > Hi Mohammad, >> > >> > This is a great idea. Is there a API call to determine the start/end >> > key for each region ? >> > >> > Thanks, >> > Gurjeet >> > >> > On Sun, Aug 12, 2012 at 3:49 PM, Mohammad Tariq <[email protected]> >> > wrote: >> > > Hello experts, >> > > >> > > Would it be feasible to create a separate thread for each >> > region??I >> > > mean we can determine start and end key of each region and issue a scan >> > for >> > > each region in parallel. >> > > >> > > Regards, >> > > Mohammad Tariq >> > > >> > > >> > > >> > > On Mon, Aug 13, 2012 at 3:54 AM, lars hofhansl <[email protected]> >> > wrote: >> > > >> > >> Do you really have to retrieve all 200.000 each time? >> > >> Scan.setBatch(...) makes no difference?! (note that batching is >> > different >> > >> and separate from caching). >> > >> >> > >> Also note that the scanner contract is to return sorted KVs, so a >> single >> > >> scan cannot be parallelized across RegionServers (well not entirely >> > true, >> > >> it could be farmed off in parallel and then be presented to the client >> > in >> > >> the right order - but HBase is not doing that). That is why one vs 12 >> > RSs >> > >> makes no difference in this scenario. >> > >> >> > >> In the 12 node case you'll see low CPU on all but one RS, and each RS >> > will >> > >> get its turn. >> > >> >> > >> In your case this is scanning 20.000.000 KVs serially in 400s, that's >> > >> 50000 KVs/s, which - depending on hardware - is not too bad for HBase >> > (but >> > >> not great either). >> > >> >> > >> If you only ever expect to run a single query like this on top your >> > >> cluster (i.e. your concern is latency not throughput) you can do >> > multiple >> > >> RPCs in parallel for a sub portion of your key range. Together with >> > >> batching can start using value before all is streamed back from the >> > server. >> > >> >> > >> >> > >> -- Lars >> > >> >> > >> >> > >> >> > >> ----- Original Message ----- >> > >> From: Gurjeet Singh <[email protected]> >> > >> To: [email protected] >> > >> Cc: >> > >> Sent: Saturday, August 11, 2012 11:04 PM >> > >> Subject: Slow full-table scans >> > >> >> > >> Hi, >> > >> >> > >> I am trying to read all the data out of an HBase table using a scan >> > >> and it is extremely slow. >> > >> >> > >> Here are some characteristics of the data: >> > >> >> > >> 1. The total table size is tiny (~200MB) >> > >> 2. The table has ~100 rows and ~200,000 columns in a SINGLE family. >> > >> Thus the size of each cell is ~10bytes and the size of each row is >> > >> ~2MB >> > >> 3. Currently scanning the whole table takes ~400s (both in a >> > >> distributed setting with 12 nodes or so and on a single node), thus >> > >> 5sec/row >> > >> 4. The row keys are unique 8 byte crypto hashes of sequential numbers >> > >> 5. The scanner is set to fetch a FULL row at a time (scan.setBatch) >> > >> and is set to fetch 100MB of data at a time (scan.setCaching) >> > >> 6. Changing the caching size seems to have no effect on the total scan >> > >> time at all >> > >> 7. The column family is setup to keep a single version of the cells, >> > >> no compression, and no block cache. >> > >> >> > >> Am I missing something ? Is there a way to optimize this ? >> > >> >> > >> I guess a general question I have is whether HBase is good datastore >> > >> for storing many medium sized (~50GB), dense datasets with lots of >> > >> columns when a lot of the queries require full table scans ? >> > >> >> > >> Thanks! >> > >> Gurjeet >> > >> >> > >> >> > >>
