You also have some other options. One would be using an IteratorChain to string together the results of several BatchScanners in a row which you could kick off in parallel to batch up your reads.
Or, writing this in a sequence model: use the Iterator<Map.Entry<Key,Value>> from the indexScanner to feed an Iterator<Map.Entry<Key,Value>> of your creation that produces document key/values. As you request the document key/values using next(), it prefetches a number of index key/values, runs a batch scan, queues the results for you. When it runs out of document results, it repeats. This model has been successful for us when hitting a term index to pull millions of source records without loading them all into client memory at the same time. On Wed, Jan 23, 2013 at 1:51 PM, Keith Turner <[email protected]> wrote: > How much data is coming back, and whats the data rate? You can sum up > the size of the keys and values in your loop. > > On Wed, Jan 23, 2013 at 1:24 PM, Slater, David M. > <[email protected]> wrote: > > First, thanks to everyone for their responses to my previous questions. > > (Mike, I’ll definitely take a look at Brian’s materials for iterator > > behavior.) > > > > > > > > Now I’m doing some sharded document querying (where the documents are > small > > but numerous)—where I’m trying to get not just the list of documents but > > also return all of them (they are also stored in Accumulo). However, I’m > > running into a bottleneck in the retrieval process. It seems that the > > BatchScanner is quite slow at retrieving information when there is a very > > large number of (small) ranges (entries, i.e. docs), and increasing the > > thread count doesn’t seem to help. > > > > > > > > Basically, I’m taking all of the docIDs that are returned from the index > > process, making a new Range(docID), adding that to Collection<Range> > ranges, > > and then adding those ranges to the new BatchScanner to return the > > information: > > > > > > > > … > > > > Collection<Range> docRanges = new LinkedList<Range>(); > > > > for (Map.Entry<Key, Value> entry : indexScanner) { // Go through index > table > > here > > > > Text docID = entry.getKey().getColumnQualifier(); > > > > docRanges.add(new Range(docID)); > > > > } > > > > > > > > int threadCount = 20; > > > > String docTableName = “docTable”; > > > > BatchScanner docScanner = connector.createBatchScanner(docTableName, new > > Authorizations(), threadCount); > > > > docScanner.setRanges(docRanges); // large collection of ranges > > > > > > > > for (Map.Entry<Key, Value> doc : docScanner) { // retrieve doc data > > > > ... > > > > } > > > > … > > > > > > > > Is this a naïve way of doing this? Would trying to group documents into > > larger ranges (when adjacent) be a more viable approach? > > > > > > > > Thanks, > > > > David > -- John Stoneham [email protected]
