Hey everyone,
I'm trying to improve the query performance of batchscans on my data table. I
first scan over index tables, which returns a set of rowIDs that correspond to
the records I am interested in. This set of records is fairly randomly (and
uniformly) distributed across a large number of tablets, due to the randomness
of the UID and the query itself. Then I want to scan over my data table, which
is setup as follows:
row colFam colQual value
rowUID -- -- byte[] of data
These records are fairly small (100s of bytes), but numerous (I may return
50000 or more). The method I use to obtain this follows. Essentially, I turn
the rows returned from the first query into a set of ranges to input into the
batchscanner, and then return those rows, retrieving the value from them.
// returns the data associated with the given collection of rows
public Collection<byte[]> getRowData(Collection<Text> rows, Text dataType,
String tablename, int queryThreads) throws TableNotFoundException {
List<byte[]> values = new ArrayList<byte[]>(rows.size());
if (!rows.isEmpty()) {
BatchScanner scanner = conn.createBatchScanner(tablename, new
Authorizations(), queryThreads);
List<Range> ranges = new ArrayList<Range>();
for (Text row : rows) {
ranges.add(new Range(row));
}
scanner.setRanges(ranges);
for (Map.Entry<Key, Value> entry : scanner) {
values.add(entry.getValue().get());
}
scanner.close();
}
return values;
}
Is there a more efficient way to do this? I have index caches and bloom filters
enabled (data caches are not), but I still seem to have a long query lag. Any
thoughts on how I can improve this?
Thanks,
David