Hi Josh,

I should have clarified - I am using a batchscanner for both lookups. I had 
thought of putting it into two different threads, but the first scan is 
typically an order of magnitude faster than the second.

The logic for upperbounding the results returned is outside of the method I 
provided. Since there is a one-to-one relationship between rowIDs and records 
on the second scan, I just limit the number of rows I send to this method. 

As for blocking, I'm not sure exactly what you mean. I complete the first scan 
in its entirety, which  before entering this method with the collection of Text 
rowIDs. The method for that is:

public Collection<Text> getRowIDs(Collection<Range> ranges, Text term, String 
tablename, int queryThreads, int limit) throws TableNotFoundException {
        Set<Text> guids = new HashSet<Text>();
        if (!ranges.isEmpty()) {
            BatchScanner scanner = conn.createBatchScanner(tablename, new 
Authorizations(), queryThreads);
            scanner.setRanges(ranges);
            scanner.fetchColumnFamily(term);
            for (Map.Entry<Key, Value> entry : scanner) {
                guids.add(entry.getKey().getColumnQualifier());
                if (guids.size() > limit) {
                    return null;
                }
            }
            scanner.close();
        }
        return guids;
    }

Essentially, my query does:
Collection<Text> rows = getRowIDs(new Range("minRow", "maxRow"), new 
Text("index"), "mytable", 10, 10000);
Collection<byte[]> data = getRowData(rows, "mytable", 10);


-----Original Message-----
From: Josh Elser [mailto:[email protected]] 
Sent: Tuesday, May 20, 2014 1:32 PM
To: [email protected]
Subject: Re: Improving Batchscanner Performance

Hi David,

Absolutely. What you have here is a classic producer-consumer model. 
Your BatchScanner is producing results, which you then consume by your scanner, 
and ultimately return those results to the client.

The problem with your below implementation is that you're not going to be 
polling your batchscanner as aggressively as you could be. You are blocking 
while you can fetch each of those new Ranges from the Scanner before fetching 
new ranges. Have you considered splitting up the BatchScanner and Scanner code 
into two different threads?

You could easily use a ArrayBlockingQueue (or similar) to pass results from the 
BatchScanner to the Scanner. I would imagine that this would give you a fair 
improvement in performance.

Also, it doesn't appear that there's a reason you can't use a BatchScanner for 
both lookups?

One final warning, your current implementation could also hog heap very badly 
if your batchscanner returns too many records. The producer/consumer I proposed 
should help here a little bit, but you should still be asserting upper-bounds 
to avoid running out of heap space in your client.

On 5/20/14, 1:10 PM, Slater, David M. wrote:
> Hey everyone,
>
> I'm trying to improve the query performance of batchscans on my data table. I 
> first scan over index tables, which returns a set of rowIDs that correspond 
> to the records I am interested in. This set of records is fairly randomly 
> (and uniformly) distributed across a large number of tablets, due to the 
> randomness of the UID and the query itself. Then I want to scan over my data 
> table, which is setup as follows:
> row                   colFam          colQual         value
> rowUID         --                     --                      byte[] of data
>
> These records are fairly small (100s of bytes), but numerous (I may return 
> 50000 or more). The method I use to obtain this follows. Essentially, I turn 
> the rows returned from the first query into a set of ranges to input into the 
> batchscanner, and then return those rows, retrieving the value from them.
>
> // returns the data associated with the given collection of rows
>      public Collection<byte[]> getRowData(Collection<Text> rows, Text 
> dataType, String tablename, int queryThreads) throws TableNotFoundException {
>          List<byte[]> values = new ArrayList<byte[]>(rows.size());
>          if (!rows.isEmpty()) {
>              BatchScanner scanner = conn.createBatchScanner(tablename, new 
> Authorizations(), queryThreads);
>              List<Range> ranges = new ArrayList<Range>();
>              for (Text row : rows) {
>                  ranges.add(new Range(row));
>              }
>              scanner.setRanges(ranges);
>              for (Map.Entry<Key, Value> entry : scanner) {
>                  values.add(entry.getValue().get());
>              }
>              scanner.close();
>          }
>          return values;
>      }
>
> Is there a more efficient way to do this? I have index caches and bloom 
> filters enabled (data caches are not), but I still seem to have a long query 
> lag. Any thoughts on how I can improve this?
>
> Thanks,
> David
>

Reply via email to