First, thanks to everyone for their responses to my previous questions. (Mike, 
I'll definitely take a look at Brian's materials for iterator behavior.)

Now I'm doing some sharded document querying (where the documents are small but 
numerous)-where I'm trying to get not just the list of documents but also 
return all of them (they are also stored in Accumulo). However, I'm running 
into a bottleneck in the retrieval process. It seems that the BatchScanner is 
quite slow at retrieving information when there is a very large number of 
(small) ranges (entries, i.e. docs), and increasing the thread count doesn't 
seem to help.

Basically, I'm taking all of the docIDs that are returned from the index 
process, making a new Range(docID), adding that to Collection<Range> ranges, 
and then adding those ranges to the new BatchScanner to return the information:

...
Collection<Range> docRanges = new LinkedList<Range>();
for (Map.Entry<Key, Value> entry : indexScanner) { // Go through index table 
here
            Text docID = entry.getKey().getColumnQualifier();
            docRanges.add(new Range(docID));
}

int threadCount = 20;
String docTableName = "docTable";
BatchScanner docScanner = connector.createBatchScanner(docTableName, new 
Authorizations(), threadCount);
docScanner.setRanges(docRanges); // large collection of ranges

for (Map.Entry<Key, Value> doc : docScanner) { // retrieve doc data
            ...
}
...

Is this a naïve way of doing this? Would trying to group documents into larger 
ranges (when adjacent) be a more viable approach?

Thanks,
David

Reply via email to