Inline John
On Sun, Aug 12, 2012 at 5:49 PM, Steven Troxell <[email protected]>wrote: > Hi All, > > I was wondering if someone would be willing to help evaulate my reasoning > on the use of Scanner vs. BatchScanner, and see if I'm making the proper > assumptions. > > The background is I am attempting to benchmark an RDF application using > Accumulo by evaluating the impact of scaling on performance (measured by > query return time). > > The scan patterns currently use the Scanner class, and gets a single row > of data. The table design/implementation is such that there is never a > need to simultaneously scan multiple non-adjacent rows. One query from > the GUI, should effectively result in a one-time single range scan. The > size of data return varies widely, as small as 10 to say millions of > results. The return order is not significant. > > Reading the API suggests: ". If you want to lookup a few ranges and > expect those ranges to contain a lot of data, then use the Scanner > instead" and the use of BatchScanner should be reserved for cases of > simultaneously wanting to use multiple ranges. It additionally feels weird > to be using batchscan on a "collection" of 1 range. > There is a lot of variety in a range. You can have a range which consists of a single row, and therefor a single server, or you can have a range which spans a large amount of data up to the entire table. In that case, while it may only be 1 range, it hits a lot of data. If the way your data is oriented in Accumulo is guaranteed to hit 1 rowID, then using a scanner vs. batch scanner for that 1 range will make no difference. > > That said, my performance so far shows scaling is not adding much, 6 > machines is the max performance of getting, with drops in performance over > that amount. This contradicts the theoretical linear improvement I should > be seeing. To my understanding, BatchScanning scans the Tservers in > parallel, Scanner does not. Would it be reasonable to expect using > BatchScanner would allow to see the effects of scaling closer to what they > should be? > If you only pull back a single row, going from a Scanner to a BatchScanner will make no difference. If you are iteratively Scanning for multiple ranges however, you could see performance improvements by doing it all in a single BatchScanner, assuming that you're not doing scans dependent on the previous' results. As for the theoretical linear improvement, the underlying assumption there is that you are in some way fully utilizing the resources before scaling up to a larger amount. If you're simply getting a single row, whether or not your on 2 or 200 machines the performance should be same, not 100 times better. Scaling out the architecture allows you to do more with it, not necessarily do the same thing faster (although that can be a noticeable effect if your swamping the systems). But depending on how you're utilizing your data, scaling out too much could also be a performance hinderer (this is the case where you are doing intersections, like the DocumentPartitioned stuff in the wiki example). > > My logic here is that I have X rows spread out across 10 machines. Right > now whether I'm using 1 machine or 10 machines it is iteratively scanning > allow rows. If I batchscanned would I be guaranteed to minimize the time > to result of that of a lookup on 1 machine, instead of the average case of > 5 machines, or worst case of 10 (assuming uniform data distribution and > various other assumptions). > >From everything I gathered from your setup, grabbing a single rowId using a Scanner would have 0 performance difference than a batch scanner. If you're rows are not necessarily by rowID, there is the potential to get back results faster because if your row spans multiple tablets, they will all return faster than had you done them sucessively like the Scanner does. But if you're grabbing X rows, you can grab all X simultaneously instead of iteratively like you would with Scanners (or without having to do your own threading on your client). > > I'm mainly just questioning this, because the API info on Scan/BatchScan, > suggests Scan is the desired choice for my application, but I don't see how > switching to Batchscan, even if I'm perhaps not utilizing it as intended, > wouldn't improve scaling potential. > > Thanks in advance for thoughts/insight. > -Steve > >
