Thanks Lars!
One final question : is it advisable to issue multiple threads
against a single HTable instance, like so:
HTable table = ...
for (i = 0; i < 10; i++) {
new ScanThread(table, startRow, endRow, rowProcessor).start();
}
....
class ScanThread implements Runnable {
public void run() {
Scan scan = new Scan()
scan.setStartRow(startRow);
scan.setEndRow(endRow);
ResultScanner scanner = table.getScanner(scan);
for (Result result : scanner) {
rowProcessor.process(result);
}
}
}
On Sun, Aug 12, 2012 at 4:00 PM, lars hofhansl <[email protected]> wrote:
> You can use HTable.{getStartEndKeys|getEndKeys|getStartKeys} to get the
> current region demarcations for your table.
> If you wanted to group threads by RegionServer (which you should) you get
> that information via HTable.getRegionLocation{s}
>
>
> -- Lars
>
>
> ----- Original Message -----
> From: Gurjeet Singh <[email protected]>
> To: [email protected]; lars hofhansl <[email protected]>
> Cc:
> Sent: Sunday, August 12, 2012 3:51 PM
> Subject: Re: Slow full-table scans
>
> Hi Lars,
>
> Yes, I need to retrieve all the values for a row at a time. That said,
> I did experiment with different batch sizes and that made no
> difference whatsoever. (caching on the other hand did make some
> difference ~2-3% faster for larger cache)
>
> I see your point about scanners returning sorted KVs. In my
> application, I simply don't care whether the results are sorted or not
> and I know the key range in advance. This is a great suggestion. Let
> me try replacing a single scan with a list of GETs or a bunch of SCANs
> with different start/stop rows.
>
> Thanks!
> Gurjeet
>
> On Sun, Aug 12, 2012 at 3:24 PM, lars hofhansl <[email protected]> wrote:
>> Do you really have to retrieve all 200.000 each time?
>> Scan.setBatch(...) makes no difference?! (note that batching is different
>> and separate from caching).
>>
>> Also note that the scanner contract is to return sorted KVs, so a single
>> scan cannot be parallelized across RegionServers (well not entirely true, it
>> could be farmed off in parallel and then be presented to the client in the
>> right order - but HBase is not doing that). That is why one vs 12 RSs makes
>> no difference in this scenario.
>>
>> In the 12 node case you'll see low CPU on all but one RS, and each RS will
>> get its turn.
>>
>> In your case this is scanning 20.000.000 KVs serially in 400s, that's 50000
>> KVs/s, which - depending on hardware - is not too bad for HBase (but not
>> great either).
>>
>> If you only ever expect to run a single query like this on top your cluster
>> (i.e. your concern is latency not throughput) you can do multiple RPCs in
>> parallel for a sub portion of your key range. Together with batching can
>> start using value before all is streamed back from the server.
>>
>>
>> -- Lars
>>
>>
>>
>> ----- Original Message -----
>> From: Gurjeet Singh <[email protected]>
>> To: [email protected]
>> Cc:
>> Sent: Saturday, August 11, 2012 11:04 PM
>> Subject: Slow full-table scans
>>
>> Hi,
>>
>> I am trying to read all the data out of an HBase table using a scan
>> and it is extremely slow.
>>
>> Here are some characteristics of the data:
>>
>> 1. The total table size is tiny (~200MB)
>> 2. The table has ~100 rows and ~200,000 columns in a SINGLE family.
>> Thus the size of each cell is ~10bytes and the size of each row is
>> ~2MB
>> 3. Currently scanning the whole table takes ~400s (both in a
>> distributed setting with 12 nodes or so and on a single node), thus
>> 5sec/row
>> 4. The row keys are unique 8 byte crypto hashes of sequential numbers
>> 5. The scanner is set to fetch a FULL row at a time (scan.setBatch)
>> and is set to fetch 100MB of data at a time (scan.setCaching)
>> 6. Changing the caching size seems to have no effect on the total scan
>> time at all
>> 7. The column family is setup to keep a single version of the cells,
>> no compression, and no block cache.
>>
>> Am I missing something ? Is there a way to optimize this ?
>>
>> I guess a general question I have is whether HBase is good datastore
>> for storing many medium sized (~50GB), dense datasets with lots of
>> columns when a lot of the queries require full table scans ?
>>
>> Thanks!
>> Gurjeet
>>
>