Josh,

The data is not significantly larger than the rows that I'm fetching. in terms 
of bandwidth, the data returned is at least 2 orders of magnitude smaller than 
the ingest rate, so I don't think it's a network issue.

I'm guessing, as Bob suggested, that it has to do with fetching a "random" set 
of rows each time. I had assumed that the batchscanner would take the 
Collection of ranges (when setting batchScanner.setRanges()), sort them, and 
then fetch data based on tablet splits. I'm guessing, based on the discussion, 
that it is not done that way. 

Does the BatchScanner fetch rows based on the ordering of the Collection?

Thanks,
David

-----Original Message-----
From: Josh Elser [mailto:[email protected]] 
Sent: Tuesday, May 20, 2014 1:59 PM
To: [email protected]
Subject: Re: Improving Batchscanner Performance

You actually stated it exactly here:

 > I complete the first scan in its entirety

Loading the data into a Collection also implies that you're loading the 
complete set of rows and blocking until you find all rows, or until you fetch 
all of the data.

 > Collection<Text> rows = getRowIDs(new Range("minRow", "maxRow"), new 
 > Text("index"), "mytable", 10, 10000);  > Collection<byte[]> data = 
 > getRowData(rows, "mytable", 10);

Both the BatchScanner and Scanner are returning KeyValue pairs in "batches". 
The client talks to server(s), reads some data and returns it to you. By virtue 
of you loading these results from the Iterator into a Collection, you are 
consuming *all* results before proceeding to fetch the data for the rows.

Now, if, like you said, looking up the rows is drastically faster than fetching 
the data, there's a question as to why this is. Is it safe to assume that the 
data is much larger than the rows you're fetching? Have you tried to see what 
the throughput of fetching this data is? If it's bounded by network speed, you 
could try compressing the data in an iterator server-side before returning it 
to the client.

You could also consider the locality of the rows that you're fetching -- are 
you fetching a "random" set of rows each time and paying a penalty of talking 
to each server to fetch the data when you could ammortize the cost if you 
fetched the data for rows that are close together. A large amount of data being 
returned is likely going to trump the additional cost in talking to many 
servers.


On 5/20/14, 1:51 PM, Slater, David M. wrote:
> Hi Josh,
>
> I should have clarified - I am using a batchscanner for both lookups. I had 
> thought of putting it into two different threads, but the first scan is 
> typically an order of magnitude faster than the second.
>
> The logic for upperbounding the results returned is outside of the method I 
> provided. Since there is a one-to-one relationship between rowIDs and records 
> on the second scan, I just limit the number of rows I send to this method.
>
> As for blocking, I'm not sure exactly what you mean. I complete the first 
> scan in its entirety, which  before entering this method with the collection 
> of Text rowIDs. The method for that is:
>
> public Collection<Text> getRowIDs(Collection<Range> ranges, Text term, String 
> tablename, int queryThreads, int limit) throws TableNotFoundException {
>          Set<Text> guids = new HashSet<Text>();
>          if (!ranges.isEmpty()) {
>              BatchScanner scanner = conn.createBatchScanner(tablename, new 
> Authorizations(), queryThreads);
>              scanner.setRanges(ranges);
>              scanner.fetchColumnFamily(term);
>              for (Map.Entry<Key, Value> entry : scanner) {
>                  guids.add(entry.getKey().getColumnQualifier());
>                  if (guids.size() > limit) {
>                      return null;
>                  }
>              }
>              scanner.close();
>          }
>          return guids;
>      }
>
> Essentially, my query does:
> Collection<Text> rows = getRowIDs(new Range("minRow", "maxRow"), new 
> Text("index"), "mytable", 10, 10000); Collection<byte[]> data = 
> getRowData(rows, "mytable", 10);
>
>
> -----Original Message-----
> From: Josh Elser [mailto:[email protected]]
> Sent: Tuesday, May 20, 2014 1:32 PM
> To: [email protected]
> Subject: Re: Improving Batchscanner Performance
>
> Hi David,
>
> Absolutely. What you have here is a classic producer-consumer model.
> Your BatchScanner is producing results, which you then consume by your 
> scanner, and ultimately return those results to the client.
>
> The problem with your below implementation is that you're not going to be 
> polling your batchscanner as aggressively as you could be. You are blocking 
> while you can fetch each of those new Ranges from the Scanner before fetching 
> new ranges. Have you considered splitting up the BatchScanner and Scanner 
> code into two different threads?
>
> You could easily use a ArrayBlockingQueue (or similar) to pass results from 
> the BatchScanner to the Scanner. I would imagine that this would give you a 
> fair improvement in performance.
>
> Also, it doesn't appear that there's a reason you can't use a BatchScanner 
> for both lookups?
>
> One final warning, your current implementation could also hog heap very badly 
> if your batchscanner returns too many records. The producer/consumer I 
> proposed should help here a little bit, but you should still be asserting 
> upper-bounds to avoid running out of heap space in your client.
>
> On 5/20/14, 1:10 PM, Slater, David M. wrote:
>> Hey everyone,
>>
>> I'm trying to improve the query performance of batchscans on my data table. 
>> I first scan over index tables, which returns a set of rowIDs that 
>> correspond to the records I am interested in. This set of records is fairly 
>> randomly (and uniformly) distributed across a large number of tablets, due 
>> to the randomness of the UID and the query itself. Then I want to scan over 
>> my data table, which is setup as follows:
>> row                  colFam          colQual         value
>> rowUID        --                     --                      byte[] of data
>>
>> These records are fairly small (100s of bytes), but numerous (I may return 
>> 50000 or more). The method I use to obtain this follows. Essentially, I turn 
>> the rows returned from the first query into a set of ranges to input into 
>> the batchscanner, and then return those rows, retrieving the value from them.
>>
>> // returns the data associated with the given collection of rows
>>       public Collection<byte[]> getRowData(Collection<Text> rows, Text 
>> dataType, String tablename, int queryThreads) throws TableNotFoundException {
>>           List<byte[]> values = new ArrayList<byte[]>(rows.size());
>>           if (!rows.isEmpty()) {
>>               BatchScanner scanner = conn.createBatchScanner(tablename, new 
>> Authorizations(), queryThreads);
>>               List<Range> ranges = new ArrayList<Range>();
>>               for (Text row : rows) {
>>                   ranges.add(new Range(row));
>>               }
>>               scanner.setRanges(ranges);
>>               for (Map.Entry<Key, Value> entry : scanner) {
>>                   values.add(entry.getValue().get());
>>               }
>>               scanner.close();
>>           }
>>           return values;
>>       }
>>
>> Is there a more efficient way to do this? I have index caches and bloom 
>> filters enabled (data caches are not), but I still seem to have a long query 
>> lag. Any thoughts on how I can improve this?
>>
>> Thanks,
>> David
>>

Reply via email to