I am actually more concerned about multiple gets within a region. I think if random rows within a region are accessed, it should always be one scan instead of doing one scan per get (just like we do for the BulkDeleteEndpoint). Wouldn't that always be faster ?
On Mon, Feb 18, 2013 at 5:48 PM, lars hofhansl <[email protected]> wrote: > As it happens we did some tests around last week. > Turns out doing Gets in batches instead of a scan still gives you 1/3 of > the performance. > > I.e. when you have a table with, say, 10m rows and scanning take N > seconds, then calling 10m Gets in batches of 1000 take ~3N, which is pretty > impressive. > > Now, this is with all data in the cache! > When the data is not in the cache and the Gets are random it is many > orders of magnitude slower, as the Gets are sprayed all over the disk. In > that case sorting the Gets and issuing scans would indeed be much more > efficient. > > > The Gets in a batch are already sorted on the client, but as N. says it is > hard to determine when to turn many Gets into a Scan with filters > automatically. Without statistics/histograms I'd even wager a guess that > would be impossible to do. > Imagine you issue 10000 random Gets, but your table has 10bn rows, in that > case it is almost certain that the Gets are faster than a scan. > Now image the Gets only cover a small key range. With statistics we could > tell whether it would beneficial to turn this into a scan. > > It's not that hard to add statistics to HBase. Would do it as part of the > compactions, and record the histograms in some table. > > > You can always do that yourself. If you suspect you are touching most rows > in a table/region, just issue a scan with a appropriate filter (may have to > implement your own filter, though). Maybe we could a version of RowFilter > that match against multiple keys. > > > -- Lars > > > > ________________________________ > From: Varun Sharma <[email protected]> > To: [email protected] > Sent: Monday, February 18, 2013 1:57 AM > Subject: Optimizing Multi Gets in hbase > > Hi, > > I am trying to batched get(s) on a cluster. Here is the code: > > List<Get> gets = ... > // Prepare my gets with the rows i need > myHTable.get(gets); > > I have two questions about the above scenario: > i) Is this the most optimal way to do this ? > ii) I have a feeling that if there are multiple gets in this case, on the > same region, then each one of those shall instantiate separate scan(s) over > the region even though a single scan is sufficient. Am I mistaken here ? > > Thanks > Varun >
