You actually stated it exactly here:
> I complete the first scan in its entirety
Loading the data into a Collection also implies that you're loading the
complete set of rows and blocking until you find all rows, or until you
fetch all of the data.
> Collection<Text> rows = getRowIDs(new Range("minRow", "maxRow"), new
Text("index"), "mytable", 10, 10000);
> Collection<byte[]> data = getRowData(rows, "mytable", 10);
Both the BatchScanner and Scanner are returning KeyValue pairs in
"batches". The client talks to server(s), reads some data and returns it
to you. By virtue of you loading these results from the Iterator into a
Collection, you are consuming *all* results before proceeding to fetch
the data for the rows.
Now, if, like you said, looking up the rows is drastically faster than
fetching the data, there's a question as to why this is. Is it safe to
assume that the data is much larger than the rows you're fetching? Have
you tried to see what the throughput of fetching this data is? If it's
bounded by network speed, you could try compressing the data in an
iterator server-side before returning it to the client.
You could also consider the locality of the rows that you're fetching --
are you fetching a "random" set of rows each time and paying a penalty
of talking to each server to fetch the data when you could ammortize the
cost if you fetched the data for rows that are close together. A large
amount of data being returned is likely going to trump the additional
cost in talking to many servers.
On 5/20/14, 1:51 PM, Slater, David M. wrote:
Hi Josh,
I should have clarified - I am using a batchscanner for both lookups. I had
thought of putting it into two different threads, but the first scan is
typically an order of magnitude faster than the second.
The logic for upperbounding the results returned is outside of the method I
provided. Since there is a one-to-one relationship between rowIDs and records
on the second scan, I just limit the number of rows I send to this method.
As for blocking, I'm not sure exactly what you mean. I complete the first scan
in its entirety, which before entering this method with the collection of Text
rowIDs. The method for that is:
public Collection<Text> getRowIDs(Collection<Range> ranges, Text term, String
tablename, int queryThreads, int limit) throws TableNotFoundException {
Set<Text> guids = new HashSet<Text>();
if (!ranges.isEmpty()) {
BatchScanner scanner = conn.createBatchScanner(tablename, new
Authorizations(), queryThreads);
scanner.setRanges(ranges);
scanner.fetchColumnFamily(term);
for (Map.Entry<Key, Value> entry : scanner) {
guids.add(entry.getKey().getColumnQualifier());
if (guids.size() > limit) {
return null;
}
}
scanner.close();
}
return guids;
}
Essentially, my query does:
Collection<Text> rows = getRowIDs(new Range("minRow", "maxRow"), new Text("index"),
"mytable", 10, 10000);
Collection<byte[]> data = getRowData(rows, "mytable", 10);
-----Original Message-----
From: Josh Elser [mailto:[email protected]]
Sent: Tuesday, May 20, 2014 1:32 PM
To: [email protected]
Subject: Re: Improving Batchscanner Performance
Hi David,
Absolutely. What you have here is a classic producer-consumer model.
Your BatchScanner is producing results, which you then consume by your scanner,
and ultimately return those results to the client.
The problem with your below implementation is that you're not going to be
polling your batchscanner as aggressively as you could be. You are blocking
while you can fetch each of those new Ranges from the Scanner before fetching
new ranges. Have you considered splitting up the BatchScanner and Scanner code
into two different threads?
You could easily use a ArrayBlockingQueue (or similar) to pass results from the
BatchScanner to the Scanner. I would imagine that this would give you a fair
improvement in performance.
Also, it doesn't appear that there's a reason you can't use a BatchScanner for
both lookups?
One final warning, your current implementation could also hog heap very badly
if your batchscanner returns too many records. The producer/consumer I proposed
should help here a little bit, but you should still be asserting upper-bounds
to avoid running out of heap space in your client.
On 5/20/14, 1:10 PM, Slater, David M. wrote:
Hey everyone,
I'm trying to improve the query performance of batchscans on my data table. I
first scan over index tables, which returns a set of rowIDs that correspond to
the records I am interested in. This set of records is fairly randomly (and
uniformly) distributed across a large number of tablets, due to the randomness
of the UID and the query itself. Then I want to scan over my data table, which
is setup as follows:
row colFam colQual value
rowUID -- -- byte[] of data
These records are fairly small (100s of bytes), but numerous (I may return
50000 or more). The method I use to obtain this follows. Essentially, I turn
the rows returned from the first query into a set of ranges to input into the
batchscanner, and then return those rows, retrieving the value from them.
// returns the data associated with the given collection of rows
public Collection<byte[]> getRowData(Collection<Text> rows, Text
dataType, String tablename, int queryThreads) throws TableNotFoundException {
List<byte[]> values = new ArrayList<byte[]>(rows.size());
if (!rows.isEmpty()) {
BatchScanner scanner = conn.createBatchScanner(tablename, new
Authorizations(), queryThreads);
List<Range> ranges = new ArrayList<Range>();
for (Text row : rows) {
ranges.add(new Range(row));
}
scanner.setRanges(ranges);
for (Map.Entry<Key, Value> entry : scanner) {
values.add(entry.getValue().get());
}
scanner.close();
}
return values;
}
Is there a more efficient way to do this? I have index caches and bloom filters
enabled (data caches are not), but I still seem to have a long query lag. Any
thoughts on how I can improve this?
Thanks,
David