Do we have a (hopefully reproducible) conclusion from this thread, regarding Scanners and BatchScanners?
On Sep 13, 2016 11:17 PM, "Josh Elser" <[email protected]> wrote: > Yeah, this seems to have been osx causing me grief. > > Spun up a 3tserver cluster (on openstack, even) and reran the same > experiment. I could not reproduce the issues, even without substantial > config tweaking. > > Josh Elser wrote: > >> I'm playing around with this a little more today and something is >> definitely weird on my local machine. I'm seeing insane spikes in >> performance using Scanners too. >> >> Coupled with Keith's inability to repro this, I am starting to think >> that these are not worthwhile numbers to put weight behind. Something I >> haven't been able to figure out is quite screwy for me. >> >> Josh Elser wrote: >> >>> Sven, et al: >>> >>> So, it would appear that I have been able to reproduce this one (better >>> late than never, I guess...). tl;dr Serially using Scanners to do point >>> lookups instead of a BatchScanner is ~20x faster. This sounds like a >>> pretty serious performance issue to me. >>> >>> Here's a general outline for what I did. >>> >>> * Accumulo 1.8.0 >>> * Created a table with 1M rows, each row with 10 columns using YCSB >>> (workloada) >>> * Split the table into 9 tablets >>> * Computed the set of all rows in the table >>> >>> For a number of iterations: >>> * Shuffle this set of rows >>> * Choose the first N rows >>> * Construct an equivalent set of Ranges from the set of Rows, choosing a >>> random column (0-9) >>> * Partition the N rows into X collections >>> * Submit X tasks to query one partition of the N rows (to a thread pool >>> with X fixed threads) >>> >>> I have two implementations of these tasks. One, where all ranges in a >>> partition are executed via one BatchWriter. A second where each range is >>> executed in serial using a Scanner. The numbers speak for themselves. >>> >>> ** BatchScanners ** >>> 2016-09-10 17:51:38,811 [joshelser.YcsbBatchScanner] INFO : Shuffled all >>> rows >>> 2016-09-10 17:51:38,843 [joshelser.YcsbBatchScanner] INFO : All ranges >>> calculated: 3000 ranges found >>> 2016-09-10 17:51:38,846 [joshelser.YcsbBatchScanner] INFO : Executing 6 >>> range partitions using a pool of 6 threads >>> 2016-09-10 17:52:19,025 [joshelser.YcsbBatchScanner] INFO : Queries >>> executed in 40178 ms >>> 2016-09-10 17:52:19,025 [joshelser.YcsbBatchScanner] INFO : Executing 6 >>> range partitions using a pool of 6 threads >>> 2016-09-10 17:53:01,321 [joshelser.YcsbBatchScanner] INFO : Queries >>> executed in 42296 ms >>> 2016-09-10 17:53:01,321 [joshelser.YcsbBatchScanner] INFO : Executing 6 >>> range partitions using a pool of 6 threads >>> 2016-09-10 17:53:47,414 [joshelser.YcsbBatchScanner] INFO : Queries >>> executed in 46094 ms >>> 2016-09-10 17:53:47,415 [joshelser.YcsbBatchScanner] INFO : Executing 6 >>> range partitions using a pool of 6 threads >>> 2016-09-10 17:54:35,118 [joshelser.YcsbBatchScanner] INFO : Queries >>> executed in 47704 ms >>> 2016-09-10 17:54:35,119 [joshelser.YcsbBatchScanner] INFO : Executing 6 >>> range partitions using a pool of 6 threads >>> 2016-09-10 17:55:24,339 [joshelser.YcsbBatchScanner] INFO : Queries >>> executed in 49221 ms >>> >>> ** Scanners ** >>> 2016-09-10 17:57:23,867 [joshelser.YcsbBatchScanner] INFO : Shuffled all >>> rows >>> 2016-09-10 17:57:23,898 [joshelser.YcsbBatchScanner] INFO : All ranges >>> calculated: 3000 ranges found >>> 2016-09-10 17:57:23,903 [joshelser.YcsbBatchScanner] INFO : Executing 6 >>> range partitions using a pool of 6 threads >>> 2016-09-10 17:57:26,738 [joshelser.YcsbBatchScanner] INFO : Queries >>> executed in 2833 ms >>> 2016-09-10 17:57:26,738 [joshelser.YcsbBatchScanner] INFO : Executing 6 >>> range partitions using a pool of 6 threads >>> 2016-09-10 17:57:29,275 [joshelser.YcsbBatchScanner] INFO : Queries >>> executed in 2536 ms >>> 2016-09-10 17:57:29,275 [joshelser.YcsbBatchScanner] INFO : Executing 6 >>> range partitions using a pool of 6 threads >>> 2016-09-10 17:57:31,425 [joshelser.YcsbBatchScanner] INFO : Queries >>> executed in 2150 ms >>> 2016-09-10 17:57:31,425 [joshelser.YcsbBatchScanner] INFO : Executing 6 >>> range partitions using a pool of 6 threads >>> 2016-09-10 17:57:33,487 [joshelser.YcsbBatchScanner] INFO : Queries >>> executed in 2061 ms >>> 2016-09-10 17:57:33,487 [joshelser.YcsbBatchScanner] INFO : Executing 6 >>> range partitions using a pool of 6 threads >>> 2016-09-10 17:57:35,628 [joshelser.YcsbBatchScanner] INFO : Queries >>> executed in 2140 ms >>> >>> Query code is available >>> https://github.com/joshelser/accumulo-range-binning >>> >>> Sven Hodapp wrote: >>> >>>> Hi Keith, >>>> >>>> I've tried it with 1, 2 or 10 threads. Unfortunately there where no >>>> amazing differences. >>>> Maybe it's a problem with the table structure? For example it may >>>> happen that one row id (e.g. a sentence) has several thousand column >>>> families. Can this affect the seek performance? >>>> >>>> So for my initial example it has about 3000 row ids to seek, which >>>> will return about 500k entries. If I filter for specific column >>>> families (e.g. a document without annotations) it will return about 5k >>>> entries, but the seek time will only be halved. >>>> Are there to much column families to seek it fast? >>>> >>>> Thanks! >>>> >>>> Regards, >>>> Sven >>>> >>>>
