Re: Improving Batchscanner Performance

Josh Elser Tue, 20 May 2014 10:32:47 -0700

Hi David,

Absolutely. What you have here is a classic producer-consumer model.Your BatchScanner is producing results, which you then consume by yourscanner, and ultimately return those results to the client.

The problem with your below implementation is that you're not going tobe polling your batchscanner as aggressively as you could be. You areblocking while you can fetch each of those new Ranges from the Scannerbefore fetching new ranges. Have you considered splitting up theBatchScanner and Scanner code into two different threads?

You could easily use a ArrayBlockingQueue (or similar) to pass resultsfrom the BatchScanner to the Scanner. I would imagine that this wouldgive you a fair improvement in performance.

Also, it doesn't appear that there's a reason you can't use aBatchScanner for both lookups?

One final warning, your current implementation could also hog heap verybadly if your batchscanner returns too many records. Theproducer/consumer I proposed should help here a little bit, but youshould still be asserting upper-bounds to avoid running out of heapspace in your client.


On 5/20/14, 1:10 PM, Slater, David M. wrote:

Hey everyone,

I'm trying to improve the query performance of batchscans on my data table. I 
first scan over index tables, which returns a set of rowIDs that correspond to 
the records I am interested in. This set of records is fairly randomly (and 
uniformly) distributed across a large number of tablets, due to the randomness 
of the UID and the query itself. Then I want to scan over my data table, which 
is setup as follows:
row                     colFam          colQual         value
rowUID           --                     --                      byte[] of data

These records are fairly small (100s of bytes), but numerous (I may return 
50000 or more). The method I use to obtain this follows. Essentially, I turn 
the rows returned from the first query into a set of ranges to input into the 
batchscanner, and then return those rows, retrieving the value from them.

// returns the data associated with the given collection of rows
     public Collection<byte[]> getRowData(Collection<Text> rows, Text dataType, 
String tablename, int queryThreads) throws TableNotFoundException {
         List<byte[]> values = new ArrayList<byte[]>(rows.size());
         if (!rows.isEmpty()) {
             BatchScanner scanner = conn.createBatchScanner(tablename, new 
Authorizations(), queryThreads);
             List<Range> ranges = new ArrayList<Range>();
             for (Text row : rows) {
                 ranges.add(new Range(row));
             }
             scanner.setRanges(ranges);
             for (Map.Entry<Key, Value> entry : scanner) {
                 values.add(entry.getValue().get());
             }
             scanner.close();
         }
         return values;
     }

Is there a more efficient way to do this? I have index caches and bloom filters 
enabled (data caches are not), but I still seem to have a long query lag. Any 
thoughts on how I can improve this?

Thanks,
David

Re: Improving Batchscanner Performance

Reply via email to