Hi All, I was wondering if someone would be willing to help evaulate my reasoning on the use of Scanner vs. BatchScanner, and see if I'm making the proper assumptions.
The background is I am attempting to benchmark an RDF application using Accumulo by evaluating the impact of scaling on performance (measured by query return time). The scan patterns currently use the Scanner class, and gets a single row of data. The table design/implementation is such that there is never a need to simultaneously scan multiple non-adjacent rows. One query from the GUI, should effectively result in a one-time single range scan. The size of data return varies widely, as small as 10 to say millions of results. The return order is not significant. Reading the API suggests: ". If you want to lookup a few ranges and expect those ranges to contain a lot of data, then use the Scanner instead" and the use of BatchScanner should be reserved for cases of simultaneously wanting to use multiple ranges. It additionally feels weird to be using batchscan on a "collection" of 1 range. That said, my performance so far shows scaling is not adding much, 6 machines is the max performance of getting, with drops in performance over that amount. This contradicts the theoretical linear improvement I should be seeing. To my understanding, BatchScanning scans the Tservers in parallel, Scanner does not. Would it be reasonable to expect using BatchScanner would allow to see the effects of scaling closer to what they should be? My logic here is that I have X rows spread out across 10 machines. Right now whether I'm using 1 machine or 10 machines it is iteratively scanning allow rows. If I batchscanned would I be guaranteed to minimize the time to result of that of a lookup on 1 machine, instead of the average case of 5 machines, or worst case of 10 (assuming uniform data distribution and various other assumptions). I'm mainly just questioning this, because the API info on Scan/BatchScan, suggests Scan is the desired choice for my application, but I don't see how switching to Batchscan, even if I'm perhaps not utilizing it as intended, wouldn't improve scaling potential. Thanks in advance for thoughts/insight. -Steve
