Imho,  the easiest thing to do would be to write a filter.
You need to order the rows, then you can use hints to navigate to the next
row (SEEK_NEXT_USING_HINT).
The main drawback I see is that the filter will be invoked on all regions
servers, including the ones that don't need it. But this would also means
you have a very specific query pattern (which could be the case, I just
don't know), and you can still use the startRow / stopRow of the scan, and
create multiple scan if necessary. I'm also interested in Lars' opinion on
this.

Nicolas



On Tue, Feb 19, 2013 at 4:52 PM, Varun Sharma <va...@pinterest.com> wrote:

> I have another question, if I am running a scan wrapped around multiple
> rows in the same region, in the following way:
>
> Scan scan = new scan(getWithMultipleRowsInSameRegion);
>
> Now, how does execution occur. Is it just a sequential scan across the
> entire region or does it seek to hfile blocks containing the actual values.
> What I truly mean is, lets say the multi get is on following rows:
>
> Row1 : HFileBlock1
> Row2 : HFileBlock20
> Row3 : Does not exist
> Row4 : HFileBlock25
> Row5 : HFileBlock100
>
> The efficient way to do this would be to determine the correct blocks using
> the index and then searching within the blocks for, say Row1. Then, seek to
> HFileBlock20 and then look for Row2. Elimininate Row3 and then keep on
> seeking to + searching within HFileBlocks as needed.
>
> I am wondering if a scan wrapped around a Get with multiple rows would do
> the same ?
>
> Thanks
> Varun
>
> On Tue, Feb 19, 2013 at 12:37 AM, Nicolas Liochon <nkey...@gmail.com>
> wrote:
>
> > Looking at the code, it seems possible to do this server side within the
> > multi invocation: we could group the get by region, and do a single scan.
> > We could also add some heuristics if necessary...
> >
> >
> >
> > On Tue, Feb 19, 2013 at 9:02 AM, lars hofhansl <la...@apache.org> wrote:
> >
> > > I should qualify that statement, actually.
> > >
> > > I was comparing scanning 1m KVs to getting 1m KVs when all KVs are
> > > returned.
> > >
> > > As James Taylor pointed out to me privately: A fairer comparison would
> > > have been to run a scan with a filter that lets x% of the rows pass
> (i.e.
> > > the selectivity of the scan would be x%) and compare that to a multi
> Get
> > of
> > > the same x% of the row.
> > >
> > > There we found that a Scan+Filter is more efficient that issuing multi
> > > Gets if x is >= 1-2%.
> > >
> > >
> > > Or in other words, translating many Gets into a Scan+Filter is
> beneficial
> > > if the Scan would return at least 1-2% of the rows to the client.
> > > For example:
> > > if you are looking for less than 10-20k rows in 1m rows, using muli
> Gets
> > > is likely more efficient.
> > > if you are looking for more than 10-20k rows in 1m rows, using a
> > > Scan+Filter is likely more efficient.
> > >
> > >
> > > Of course this is predicated on whether you have an efficient way to
> > > represent the rows you are looking for in a filter, so that would
> > probably
> > > shift this slightly more towards Gets (just imaging a Filter that to
> > encode
> > > 100k random row keys to be matched; since Filters are instantiated
> store
> > > there is another natural limit there).
> > >
> > >
> > > As I said below, the crux of the matter is having some histograms of
> your
> > > data, so that such a decision could be made automatically.
> > >
> > >
> > > -- Lars
> > >
> > >
> > >
> > > ________________________________
> > >  From: lars hofhansl <la...@apache.org>
> > > To: "user@hbase.apache.org" <user@hbase.apache.org>
> > > Sent: Monday, February 18, 2013 5:48 PM
> > > Subject: Re: Optimizing Multi Gets in hbase
> > >
> > > As it happens we did some tests around last week.
> > > Turns out doing Gets in batches instead of a scan still gives you 1/3
> of
> > > the performance.
> > >
> > > I.e. when you have a table with, say, 10m rows and scanning take N
> > > seconds, then calling 10m Gets in batches of 1000 take ~3N, which is
> > pretty
> > > impressive.
> > >
> > > Now, this is with all data in the cache!
> > > When the data is not in the cache and the Gets are random it is many
> > > orders of magnitude slower, as the Gets are sprayed all over the disk.
> In
> > > that case sorting the Gets and issuing scans would indeed be much more
> > > efficient.
> > >
> > >
> > > The Gets in a batch are already sorted on the client, but as N. says it
> > is
> > > hard to determine when to turn many Gets into a Scan with filters
> > > automatically. Without statistics/histograms I'd even wager a guess
> that
> > > would be impossible to do.
> > > Imagine you issue 10000 random Gets, but your table has 10bn rows, in
> > that
> > > case it is almost certain that the Gets are faster than a scan.
> > > Now image the Gets only cover a small key range. With statistics we
> could
> > > tell whether it would beneficial to turn this into a scan.
> > >
> > > It's not that hard to add statistics to HBase. Would do it as part of
> the
> > > compactions, and record the histograms in some table.
> > >
> > >
> > > You can always do that yourself. If you suspect you are touching most
> > rows
> > > in a table/region, just issue a scan with a appropriate filter (may
> have
> > to
> > > implement your own filter, though). Maybe we could a version of
> RowFilter
> > > that match against multiple keys.
> > >
> > >
> > > -- Lars
> > >
> > >
> > >
> > > ________________________________
> > > From: Varun Sharma <va...@pinterest.com>
> > > To: user@hbase.apache.org
> > > Sent: Monday, February 18, 2013 1:57 AM
> > > Subject: Optimizing Multi Gets in hbase
> > >
> > > Hi,
> > >
> > > I am trying to batched get(s) on a cluster. Here is the code:
> > >
> > > List<Get> gets = ...
> > > // Prepare my gets with the rows i need
> > > myHTable.get(gets);
> > >
> > > I have two questions about the above scenario:
> > > i) Is this the most optimal way to do this ?
> > > ii) I have a feeling that if there are multiple gets in this case, on
> the
> > > same region, then each one of those shall instantiate separate scan(s)
> > over
> > > the region even though a single scan is sufficient. Am I mistaken here
> ?
> > >
> > > Thanks
> > > Varun
> > >
> >
>

Reply via email to