Re: Optimizing Multi Gets in hbase

Nicolas Liochon Tue, 19 Feb 2013 10:46:59 -0800

As well, an advantage of going only to the servers needed is the famous
MTTR: there are a less chance to go to a dead server or to a region that
just moved.



On Tue, Feb 19, 2013 at 7:42 PM, Nicolas Liochon <[email protected]> wrote:

> Interesting, in the client we're doing a group by location the multiget.
> So we could have the filter as HBase core code, and then we could use it
> in the client for the multiget: compared to my initial proposal, we don't
> have to change anything in the server code and we reuse the filtering
> framework. The filter can be also be used independently.
>
> Is there any issue with this? The reseek seems to be quite smart in the
> way it handles the bloom filters, I don't know if it behaves differently in
> this case vs. a simple get.
>
>
> On Tue, Feb 19, 2013 at 7:27 PM, lars hofhansl <[email protected]> wrote:
>
>> I was thinking along the same lines. Doing a skip scan via filter
>> hinting. The problem is as you say that the Filter is instantiated
>> everywhere and it might be of significant size (have to maintain all row
>> keys you are looking for).
>>
>>
>> RegionScanner now a reseek method, it is possible to do this via a
>> coprocessor. They are also loaded per region (but at least not for each
>> store), and one can use the shared coproc state I added to alleviate the
>> memory concern.
>>
>> Thinking about this in terms of multiple scan is interesting. One could
>> identify clusters of close row keys in the Gets and issue a Scan for each
>> cluster.
>>
>>
>> -- Lars
>>
>>
>>
>> ________________________________
>>  From: Nicolas Liochon <[email protected]>
>> To: user <[email protected]>
>> Sent: Tuesday, February 19, 2013 9:28 AM
>> Subject: Re: Optimizing Multi Gets in hbase
>>
>> Imho,  the easiest thing to do would be to write a filter.
>> You need to order the rows, then you can use hints to navigate to the next
>> row (SEEK_NEXT_USING_HINT).
>> The main drawback I see is that the filter will be invoked on all regions
>> servers, including the ones that don't need it. But this would also means
>> you have a very specific query pattern (which could be the case, I just
>> don't know), and you can still use the startRow / stopRow of the scan, and
>> create multiple scan if necessary. I'm also interested in Lars' opinion on
>> this.
>>
>> Nicolas
>>
>>
>>
>> On Tue, Feb 19, 2013 at 4:52 PM, Varun Sharma <[email protected]>
>> wrote:
>>
>> > I have another question, if I am running a scan wrapped around multiple
>> > rows in the same region, in the following way:
>> >
>> > Scan scan = new scan(getWithMultipleRowsInSameRegion);
>> >
>> > Now, how does execution occur. Is it just a sequential scan across the
>> > entire region or does it seek to hfile blocks containing the actual
>> values.
>> > What I truly mean is, lets say the multi get is on following rows:
>> >
>> > Row1 : HFileBlock1
>> > Row2 : HFileBlock20
>> > Row3 : Does not exist
>> > Row4 : HFileBlock25
>> > Row5 : HFileBlock100
>> >
>> > The efficient way to do this would be to determine the correct blocks
>> using
>> > the index and then searching within the blocks for, say Row1. Then,
>> seek to
>> > HFileBlock20 and then look for Row2. Elimininate Row3 and then keep on
>> > seeking to + searching within HFileBlocks as needed.
>> >
>> > I am wondering if a scan wrapped around a Get with multiple rows would
>> do
>> > the same ?
>> >
>> > Thanks
>> > Varun
>> >
>> > On Tue, Feb 19, 2013 at 12:37 AM, Nicolas Liochon <[email protected]>
>> > wrote:
>> >
>> > > Looking at the code, it seems possible to do this server side within
>> the
>> > > multi invocation: we could group the get by region, and do a single
>> scan.
>> > > We could also add some heuristics if necessary...
>> > >
>> > >
>> > >
>> > > On Tue, Feb 19, 2013 at 9:02 AM, lars hofhansl <[email protected]>
>> wrote:
>> > >
>> > > > I should qualify that statement, actually.
>> > > >
>> > > > I was comparing scanning 1m KVs to getting 1m KVs when all KVs are
>> > > > returned.
>> > > >
>> > > > As James Taylor pointed out to me privately: A fairer comparison
>> would
>> > > > have been to run a scan with a filter that lets x% of the rows pass
>> > (i.e.
>> > > > the selectivity of the scan would be x%) and compare that to a multi
>> > Get
>> > > of
>> > > > the same x% of the row.
>> > > >
>> > > > There we found that a Scan+Filter is more efficient that issuing
>> multi
>> > > > Gets if x is >= 1-2%.
>> > > >
>> > > >
>> > > > Or in other words, translating many Gets into a Scan+Filter is
>> > beneficial
>> > > > if the Scan would return at least 1-2% of the rows to the client.
>> > > > For example:
>> > > > if you are looking for less than 10-20k rows in 1m rows, using muli
>> > Gets
>> > > > is likely more efficient.
>> > > > if you are looking for more than 10-20k rows in 1m rows, using a
>> > > > Scan+Filter is likely more efficient.
>> > > >
>> > > >
>> > > > Of course this is predicated on whether you have an efficient way to
>> > > > represent the rows you are looking for in a filter, so that would
>> > > probably
>> > > > shift this slightly more towards Gets (just imaging a Filter that to
>> > > encode
>> > > > 100k random row keys to be matched; since Filters are instantiated
>> > store
>> > > > there is another natural limit there).
>> > > >
>> > > >
>> > > > As I said below, the crux of the matter is having some histograms of
>> > your
>> > > > data, so that such a decision could be made automatically.
>> > > >
>> > > >
>> > > > -- Lars
>> > > >
>> > > >
>> > > >
>> > > > ________________________________
>> > > >  From: lars hofhansl <[email protected]>
>> > > > To: "[email protected]" <[email protected]>
>> > > > Sent: Monday, February 18, 2013 5:48 PM
>> > > > Subject: Re: Optimizing Multi Gets in hbase
>> > > >
>> > > > As it happens we did some tests around last week.
>> > > > Turns out doing Gets in batches instead of a scan still gives you
>> 1/3
>> > of
>> > > > the performance.
>> > > >
>> > > > I.e. when you have a table with, say, 10m rows and scanning take N
>> > > > seconds, then calling 10m Gets in batches of 1000 take ~3N, which is
>> > > pretty
>> > > > impressive.
>> > > >
>> > > > Now, this is with all data in the cache!
>> > > > When the data is not in the cache and the Gets are random it is many
>> > > > orders of magnitude slower, as the Gets are sprayed all over the
>> disk.
>> > In
>> > > > that case sorting the Gets and issuing scans would indeed be much
>> more
>> > > > efficient.
>> > > >
>> > > >
>> > > > The Gets in a batch are already sorted on the client, but as N.
>> says it
>> > > is
>> > > > hard to determine when to turn many Gets into a Scan with filters
>> > > > automatically. Without statistics/histograms I'd even wager a guess
>> > that
>> > > > would be impossible to do.
>> > > > Imagine you issue 10000 random Gets, but your table has 10bn rows,
>> in
>> > > that
>> > > > case it is almost certain that the Gets are faster than a scan.
>> > > > Now image the Gets only cover a small key range. With statistics we
>> > could
>> > > > tell whether it would beneficial to turn this into a scan.
>> > > >
>> > > > It's not that hard to add statistics to HBase. Would do it as part
>> of
>> > the
>> > > > compactions, and record the histograms in some table.
>> > > >
>> > > >
>> > > > You can always do that yourself. If you suspect you are touching
>> most
>> > > rows
>> > > > in a table/region, just issue a scan with a appropriate filter (may
>> > have
>> > > to
>> > > > implement your own filter, though). Maybe we could a version of
>> > RowFilter
>> > > > that match against multiple keys.
>> > > >
>> > > >
>> > > > -- Lars
>> > > >
>> > > >
>> > > >
>> > > > ________________________________
>> > > > From: Varun Sharma <[email protected]>
>> > > > To: [email protected]
>> > > > Sent: Monday, February 18, 2013 1:57 AM
>> > > > Subject: Optimizing Multi Gets in hbase
>> > > >
>> > > > Hi,
>> > > >
>> > > > I am trying to batched get(s) on a cluster. Here is the code:
>> > > >
>> > > > List<Get> gets = ...
>> > > > // Prepare my gets with the rows i need
>> > > > myHTable.get(gets);
>> > > >
>> > > > I have two questions about the above scenario:
>> > > > i) Is this the most optimal way to do this ?
>> > > > ii) I have a feeling that if there are multiple gets in this case,
>> on
>> > the
>> > > > same region, then each one of those shall instantiate separate
>> scan(s)
>> > > over
>> > > > the region even though a single scan is sufficient. Am I mistaken
>> here
>> > ?
>> > > >
>> > > > Thanks
>> > > > Varun
>> > > >
>> > >
>> >
>>
>
>

Re: Optimizing Multi Gets in hbase

Reply via email to