Re: Optimizing Multi Gets in hbase

Varun Sharma Mon, 18 Feb 2013 22:45:47 -0800

I am actually more concerned about multiple gets within a region. I think
if random rows within a region are accessed, it should always be one scan
instead of doing one scan per get (just like we do for the
BulkDeleteEndpoint). Wouldn't that always be faster ?


On Mon, Feb 18, 2013 at 5:48 PM, lars hofhansl <[email protected]> wrote:

> As it happens we did some tests around last week.
> Turns out doing Gets in batches instead of a scan still gives you 1/3 of
> the performance.
>
> I.e. when you have a table with, say, 10m rows and scanning take N
> seconds, then calling 10m Gets in batches of 1000 take ~3N, which is pretty
> impressive.
>
> Now, this is with all data in the cache!
> When the data is not in the cache and the Gets are random it is many
> orders of magnitude slower, as the Gets are sprayed all over the disk. In
> that case sorting the Gets and issuing scans would indeed be much more
> efficient.
>
>
> The Gets in a batch are already sorted on the client, but as N. says it is
> hard to determine when to turn many Gets into a Scan with filters
> automatically. Without statistics/histograms I'd even wager a guess that
> would be impossible to do.
> Imagine you issue 10000 random Gets, but your table has 10bn rows, in that
> case it is almost certain that the Gets are faster than a scan.
> Now image the Gets only cover a small key range. With statistics we could
> tell whether it would beneficial to turn this into a scan.
>
> It's not that hard to add statistics to HBase. Would do it as part of the
> compactions, and record the histograms in some table.
>
>
> You can always do that yourself. If you suspect you are touching most rows
> in a table/region, just issue a scan with a appropriate filter (may have to
> implement your own filter, though). Maybe we could a version of RowFilter
> that match against multiple keys.
>
>
> -- Lars
>
>
>
> ________________________________
>  From: Varun Sharma <[email protected]>
> To: [email protected]
> Sent: Monday, February 18, 2013 1:57 AM
> Subject: Optimizing Multi Gets in hbase
>
> Hi,
>
> I am trying to batched get(s) on a cluster. Here is the code:
>
> List<Get> gets = ...
> // Prepare my gets with the rows i need
> myHTable.get(gets);
>
> I have two questions about the above scenario:
> i) Is this the most optimal way to do this ?
> ii) I have a feeling that if there are multiple gets in this case, on the
> same region, then each one of those shall instantiate separate scan(s) over
> the region even though a single scan is sufficient. Am I mistaken here ?
>
> Thanks
> Varun
>

Re: Optimizing Multi Gets in hbase

Reply via email to