Josh, Two questions:
1. How many iterations did you do? I would like to see an absolute number of lookups per second to compare against other observations. 2. Can you post your code somewhere so I can run it? Thanks, Adam On Sat, Sep 10, 2016 at 3:01 PM, Josh Elser <[email protected]> wrote: > Sven, et al: > > So, it would appear that I have been able to reproduce this one (better > late than never, I guess...). tl;dr Serially using Scanners to do point > lookups instead of a BatchScanner is ~20x faster. This sounds like a pretty > serious performance issue to me. > > Here's a general outline for what I did. > > * Accumulo 1.8.0 > * Created a table with 1M rows, each row with 10 columns using YCSB > (workloada) > * Split the table into 9 tablets > * Computed the set of all rows in the table > > For a number of iterations: > * Shuffle this set of rows > * Choose the first N rows > * Construct an equivalent set of Ranges from the set of Rows, choosing a > random column (0-9) > * Partition the N rows into X collections > * Submit X tasks to query one partition of the N rows (to a thread pool > with X fixed threads) > > I have two implementations of these tasks. One, where all ranges in a > partition are executed via one BatchWriter. A second where each range is > executed in serial using a Scanner. The numbers speak for themselves. > > ** BatchScanners ** > 2016-09-10 17:51:38,811 [joshelser.YcsbBatchScanner] INFO : Shuffled all > rows > 2016-09-10 17:51:38,843 [joshelser.YcsbBatchScanner] INFO : All ranges > calculated: 3000 ranges found > 2016-09-10 17:51:38,846 [joshelser.YcsbBatchScanner] INFO : Executing 6 > range partitions using a pool of 6 threads > 2016-09-10 17:52:19,025 [joshelser.YcsbBatchScanner] INFO : Queries > executed in 40178 ms > 2016-09-10 17:52:19,025 [joshelser.YcsbBatchScanner] INFO : Executing 6 > range partitions using a pool of 6 threads > 2016-09-10 17:53:01,321 [joshelser.YcsbBatchScanner] INFO : Queries > executed in 42296 ms > 2016-09-10 17:53:01,321 [joshelser.YcsbBatchScanner] INFO : Executing 6 > range partitions using a pool of 6 threads > 2016-09-10 17:53:47,414 [joshelser.YcsbBatchScanner] INFO : Queries > executed in 46094 ms > 2016-09-10 17:53:47,415 [joshelser.YcsbBatchScanner] INFO : Executing 6 > range partitions using a pool of 6 threads > 2016-09-10 17:54:35,118 [joshelser.YcsbBatchScanner] INFO : Queries > executed in 47704 ms > 2016-09-10 17:54:35,119 [joshelser.YcsbBatchScanner] INFO : Executing 6 > range partitions using a pool of 6 threads > 2016-09-10 17:55:24,339 [joshelser.YcsbBatchScanner] INFO : Queries > executed in 49221 ms > > ** Scanners ** > 2016-09-10 17:57:23,867 [joshelser.YcsbBatchScanner] INFO : Shuffled all > rows > 2016-09-10 17:57:23,898 [joshelser.YcsbBatchScanner] INFO : All ranges > calculated: 3000 ranges found > 2016-09-10 17:57:23,903 [joshelser.YcsbBatchScanner] INFO : Executing 6 > range partitions using a pool of 6 threads > 2016-09-10 17:57:26,738 [joshelser.YcsbBatchScanner] INFO : Queries > executed in 2833 ms > 2016-09-10 17:57:26,738 [joshelser.YcsbBatchScanner] INFO : Executing 6 > range partitions using a pool of 6 threads > 2016-09-10 17:57:29,275 [joshelser.YcsbBatchScanner] INFO : Queries > executed in 2536 ms > 2016-09-10 17:57:29,275 [joshelser.YcsbBatchScanner] INFO : Executing 6 > range partitions using a pool of 6 threads > 2016-09-10 17:57:31,425 [joshelser.YcsbBatchScanner] INFO : Queries > executed in 2150 ms > 2016-09-10 17:57:31,425 [joshelser.YcsbBatchScanner] INFO : Executing 6 > range partitions using a pool of 6 threads > 2016-09-10 17:57:33,487 [joshelser.YcsbBatchScanner] INFO : Queries > executed in 2061 ms > 2016-09-10 17:57:33,487 [joshelser.YcsbBatchScanner] INFO : Executing 6 > range partitions using a pool of 6 threads > 2016-09-10 17:57:35,628 [joshelser.YcsbBatchScanner] INFO : Queries > executed in 2140 ms > > Query code is available https://github.com/joshelser/a > ccumulo-range-binning > > > Sven Hodapp wrote: > >> Hi Keith, >> >> I've tried it with 1, 2 or 10 threads. Unfortunately there where no >> amazing differences. >> Maybe it's a problem with the table structure? For example it may happen >> that one row id (e.g. a sentence) has several thousand column families. Can >> this affect the seek performance? >> >> So for my initial example it has about 3000 row ids to seek, which will >> return about 500k entries. If I filter for specific column families (e.g. a >> document without annotations) it will return about 5k entries, but the seek >> time will only be halved. >> Are there to much column families to seek it fast? >> >> Thanks! >> >> Regards, >> Sven >> >>
