Re: Accumulo Seek performance

Adam Fuchs Mon, 12 Sep 2016 08:42:06 -0700

Josh,

Two questions:


1. How many iterations did you do? I would like to see an absolute number
of lookups per second to compare against other observations.

2. Can you post your code somewhere so I can run it?

Thanks,
Adam


On Sat, Sep 10, 2016 at 3:01 PM, Josh Elser <[email protected]> wrote:

> Sven, et al:
>
> So, it would appear that I have been able to reproduce this one (better
> late than never, I guess...). tl;dr Serially using Scanners to do point
> lookups instead of a BatchScanner is ~20x faster. This sounds like a pretty
> serious performance issue to me.
>
> Here's a general outline for what I did.
>
> * Accumulo 1.8.0
> * Created a table with 1M rows, each row with 10 columns using YCSB
> (workloada)
> * Split the table into 9 tablets
> * Computed the set of all rows in the table
>
> For a number of iterations:
> * Shuffle this set of rows
> * Choose the first N rows
> * Construct an equivalent set of Ranges from the set of Rows, choosing a
> random column (0-9)
> * Partition the N rows into X collections
> * Submit X tasks to query one partition of the N rows (to a thread pool
> with X fixed threads)
>
> I have two implementations of these tasks. One, where all ranges in a
> partition are executed via one BatchWriter. A second where each range is
> executed in serial using a Scanner. The numbers speak for themselves.
>
> ** BatchScanners **
> 2016-09-10 17:51:38,811 [joshelser.YcsbBatchScanner] INFO : Shuffled all
> rows
> 2016-09-10 17:51:38,843 [joshelser.YcsbBatchScanner] INFO : All ranges
> calculated: 3000 ranges found
> 2016-09-10 17:51:38,846 [joshelser.YcsbBatchScanner] INFO : Executing 6
> range partitions using a pool of 6 threads
> 2016-09-10 17:52:19,025 [joshelser.YcsbBatchScanner] INFO : Queries
> executed in 40178 ms
> 2016-09-10 17:52:19,025 [joshelser.YcsbBatchScanner] INFO : Executing 6
> range partitions using a pool of 6 threads
> 2016-09-10 17:53:01,321 [joshelser.YcsbBatchScanner] INFO : Queries
> executed in 42296 ms
> 2016-09-10 17:53:01,321 [joshelser.YcsbBatchScanner] INFO : Executing 6
> range partitions using a pool of 6 threads
> 2016-09-10 17:53:47,414 [joshelser.YcsbBatchScanner] INFO : Queries
> executed in 46094 ms
> 2016-09-10 17:53:47,415 [joshelser.YcsbBatchScanner] INFO : Executing 6
> range partitions using a pool of 6 threads
> 2016-09-10 17:54:35,118 [joshelser.YcsbBatchScanner] INFO : Queries
> executed in 47704 ms
> 2016-09-10 17:54:35,119 [joshelser.YcsbBatchScanner] INFO : Executing 6
> range partitions using a pool of 6 threads
> 2016-09-10 17:55:24,339 [joshelser.YcsbBatchScanner] INFO : Queries
> executed in 49221 ms
>
> ** Scanners **
> 2016-09-10 17:57:23,867 [joshelser.YcsbBatchScanner] INFO : Shuffled all
> rows
> 2016-09-10 17:57:23,898 [joshelser.YcsbBatchScanner] INFO : All ranges
> calculated: 3000 ranges found
> 2016-09-10 17:57:23,903 [joshelser.YcsbBatchScanner] INFO : Executing 6
> range partitions using a pool of 6 threads
> 2016-09-10 17:57:26,738 [joshelser.YcsbBatchScanner] INFO : Queries
> executed in 2833 ms
> 2016-09-10 17:57:26,738 [joshelser.YcsbBatchScanner] INFO : Executing 6
> range partitions using a pool of 6 threads
> 2016-09-10 17:57:29,275 [joshelser.YcsbBatchScanner] INFO : Queries
> executed in 2536 ms
> 2016-09-10 17:57:29,275 [joshelser.YcsbBatchScanner] INFO : Executing 6
> range partitions using a pool of 6 threads
> 2016-09-10 17:57:31,425 [joshelser.YcsbBatchScanner] INFO : Queries
> executed in 2150 ms
> 2016-09-10 17:57:31,425 [joshelser.YcsbBatchScanner] INFO : Executing 6
> range partitions using a pool of 6 threads
> 2016-09-10 17:57:33,487 [joshelser.YcsbBatchScanner] INFO : Queries
> executed in 2061 ms
> 2016-09-10 17:57:33,487 [joshelser.YcsbBatchScanner] INFO : Executing 6
> range partitions using a pool of 6 threads
> 2016-09-10 17:57:35,628 [joshelser.YcsbBatchScanner] INFO : Queries
> executed in 2140 ms
>
> Query code is available https://github.com/joshelser/a
> ccumulo-range-binning
>
>
> Sven Hodapp wrote:
>
>> Hi Keith,
>>
>> I've tried it with 1, 2 or 10 threads. Unfortunately there where no
>> amazing differences.
>> Maybe it's a problem with the table structure? For example it may happen
>> that one row id (e.g. a sentence) has several thousand column families. Can
>> this affect the seek performance?
>>
>> So for my initial example it has about 3000 row ids to seek, which will
>> return about 500k entries. If I filter for specific column families (e.g. a
>> document without annotations) it will return about 5k entries, but the seek
>> time will only be halved.
>> Are there to much column families to seek it fast?
>>
>> Thanks!
>>
>> Regards,
>> Sven
>>
>>

Re: Accumulo Seek performance

Reply via email to