David I use this same pattern and have for the last couple of years. What I have put in place is a cache (redis) between the batchscanner thread that is reading the index tables and a separate consumer thread that's doing the final lookup of the rowID's. The rowID pool can grow as it needs to in redis, then the consumer thread pulls whatever ID's it needs to do the final scan to get my row data. Tuning the batchsize for the scanner to a small number to match the consumers needs has helped us keep the final scanner from blocking on too much data or flooding the network with unneeded data between the server and client(s).
I would recommend you reconsider the used of UUID for rowID, we found a significant performance improvement by using an organic rowID that fit the data/purpose better which orders the data on fewer tablet servers most of the time. The accumulation of more threads in a batch scanner just because we used random rowID seems to have diminishing returns as the rowID's approach randomness. HTH -----Original Message----- From: Slater, David M. [mailto:[email protected]] Sent: Tuesday, May 20, 2014 12:11 PM To: [email protected] Subject: Improving Batchscanner Performance Hey everyone, I'm trying to improve the query performance of batchscans on my data table. I first scan over index tables, which returns a set of rowIDs that correspond to the records I am interested in. This set of records is fairly randomly (and uniformly) distributed across a large number of tablets, due to the randomness of the UID and the query itself. Then I want to scan over my data table, which is setup as follows: row colFam colQual value rowUID -- -- byte[] of data These records are fairly small (100s of bytes), but numerous (I may return 50000 or more). The method I use to obtain this follows. Essentially, I turn the rows returned from the first query into a set of ranges to input into the batchscanner, and then return those rows, retrieving the value from them. // returns the data associated with the given collection of rows public Collection<byte[]> getRowData(Collection<Text> rows, Text dataType, String tablename, int queryThreads) throws TableNotFoundException { List<byte[]> values = new ArrayList<byte[]>(rows.size()); if (!rows.isEmpty()) { BatchScanner scanner = conn.createBatchScanner(tablename, new Authorizations(), queryThreads); List<Range> ranges = new ArrayList<Range>(); for (Text row : rows) { ranges.add(new Range(row)); } scanner.setRanges(ranges); for (Map.Entry<Key, Value> entry : scanner) { values.add(entry.getValue().get()); } scanner.close(); } return values; } Is there a more efficient way to do this? I have index caches and bloom filters enabled (data caches are not), but I still seem to have a long query lag. Any thoughts on how I can improve this? Thanks, David
