David

I use this same pattern and have for the last couple of years.  What I have put 
in place is a cache (redis) between the batchscanner thread that is reading the 
index tables and a separate consumer thread that's doing the final lookup of 
the rowID's.  The rowID pool can grow as it needs to in redis, then the 
consumer thread pulls whatever ID's it needs to do the final scan to get my row 
data.  Tuning the batchsize for the scanner to a small number to match the 
consumers needs has helped us keep the final scanner from blocking on too much 
data or flooding the network with unneeded data between the server and 
client(s).

I would recommend you reconsider the used of UUID for rowID, we found a 
significant performance improvement by using an organic rowID that fit the 
data/purpose better which orders the data on fewer tablet servers most of the 
time.  The accumulation of more threads in a batch scanner just because we used 
random rowID seems to have diminishing returns as the rowID's approach 
randomness.

HTH

-----Original Message-----
From: Slater, David M. [mailto:[email protected]] 
Sent: Tuesday, May 20, 2014 12:11 PM
To: [email protected]
Subject: Improving Batchscanner Performance

Hey everyone,

I'm trying to improve the query performance of batchscans on my data table. I 
first scan over index tables, which returns a set of rowIDs that correspond to 
the records I am interested in. This set of records is fairly randomly (and 
uniformly) distributed across a large number of tablets, due to the randomness 
of the UID and the query itself. Then I want to scan over my data table, which 
is setup as follows:
row                     colFam          colQual         value
rowUID           --                     --                      byte[] of data

These records are fairly small (100s of bytes), but numerous (I may return 
50000 or more). The method I use to obtain this follows. Essentially, I turn 
the rows returned from the first query into a set of ranges to input into the 
batchscanner, and then return those rows, retrieving the value from them. 

// returns the data associated with the given collection of rows
    public Collection<byte[]> getRowData(Collection<Text> rows, Text dataType, 
String tablename, int queryThreads) throws TableNotFoundException {
        List<byte[]> values = new ArrayList<byte[]>(rows.size());
        if (!rows.isEmpty()) {
            BatchScanner scanner = conn.createBatchScanner(tablename, new 
Authorizations(), queryThreads);
            List<Range> ranges = new ArrayList<Range>();
            for (Text row : rows) {
                ranges.add(new Range(row));
            }
            scanner.setRanges(ranges);
            for (Map.Entry<Key, Value> entry : scanner) {
                values.add(entry.getValue().get());
            }
            scanner.close();
        }
        return values;
    }

Is there a more efficient way to do this? I have index caches and bloom filters 
enabled (data caches are not), but I still seem to have a long query lag. Any 
thoughts on how I can improve this?

Thanks,
David

Reply via email to