RE: Accumulo Seek performance

Dan Blum Mon, 12 Sep 2016 07:56:31 -0700

Is this a problem specific to 1.8.0, or is it likely to affect earlier versions?


-----Original Message-----
From: Josh Elser [mailto:[email protected]] 
Sent: Saturday, September 10, 2016 6:01 PM
To: [email protected]
Subject: Re: Accumulo Seek performance

Sven, et al:

So, it would appear that I have been able to reproduce this one (better 
late than never, I guess...). tl;dr Serially using Scanners to do point 
lookups instead of a BatchScanner is ~20x faster. This sounds like a 
pretty serious performance issue to me.

Here's a general outline for what I did.

* Accumulo 1.8.0
* Created a table with 1M rows, each row with 10 columns using YCSB 
(workloada)
* Split the table into 9 tablets
* Computed the set of all rows in the table

For a number of iterations:
* Shuffle this set of rows
* Choose the first N rows
* Construct an equivalent set of Ranges from the set of Rows, choosing a 
random column (0-9)
* Partition the N rows into X collections
* Submit X tasks to query one partition of the N rows (to a thread pool 
with X fixed threads)

I have two implementations of these tasks. One, where all ranges in a 
partition are executed via one BatchWriter. A second where each range is 
executed in serial using a Scanner. The numbers speak for themselves.

** BatchScanners **
2016-09-10 17:51:38,811 [joshelser.YcsbBatchScanner] INFO : Shuffled all 
rows
2016-09-10 17:51:38,843 [joshelser.YcsbBatchScanner] INFO : All ranges 
calculated: 3000 ranges found
2016-09-10 17:51:38,846 [joshelser.YcsbBatchScanner] INFO : Executing 6 
range partitions using a pool of 6 threads
2016-09-10 17:52:19,025 [joshelser.YcsbBatchScanner] INFO : Queries 
executed in 40178 ms
2016-09-10 17:52:19,025 [joshelser.YcsbBatchScanner] INFO : Executing 6 
range partitions using a pool of 6 threads
2016-09-10 17:53:01,321 [joshelser.YcsbBatchScanner] INFO : Queries 
executed in 42296 ms
2016-09-10 17:53:01,321 [joshelser.YcsbBatchScanner] INFO : Executing 6 
range partitions using a pool of 6 threads
2016-09-10 17:53:47,414 [joshelser.YcsbBatchScanner] INFO : Queries 
executed in 46094 ms
2016-09-10 17:53:47,415 [joshelser.YcsbBatchScanner] INFO : Executing 6 
range partitions using a pool of 6 threads
2016-09-10 17:54:35,118 [joshelser.YcsbBatchScanner] INFO : Queries 
executed in 47704 ms
2016-09-10 17:54:35,119 [joshelser.YcsbBatchScanner] INFO : Executing 6 
range partitions using a pool of 6 threads
2016-09-10 17:55:24,339 [joshelser.YcsbBatchScanner] INFO : Queries 
executed in 49221 ms

** Scanners **
2016-09-10 17:57:23,867 [joshelser.YcsbBatchScanner] INFO : Shuffled all 
rows
2016-09-10 17:57:23,898 [joshelser.YcsbBatchScanner] INFO : All ranges 
calculated: 3000 ranges found
2016-09-10 17:57:23,903 [joshelser.YcsbBatchScanner] INFO : Executing 6 
range partitions using a pool of 6 threads
2016-09-10 17:57:26,738 [joshelser.YcsbBatchScanner] INFO : Queries 
executed in 2833 ms
2016-09-10 17:57:26,738 [joshelser.YcsbBatchScanner] INFO : Executing 6 
range partitions using a pool of 6 threads
2016-09-10 17:57:29,275 [joshelser.YcsbBatchScanner] INFO : Queries 
executed in 2536 ms
2016-09-10 17:57:29,275 [joshelser.YcsbBatchScanner] INFO : Executing 6 
range partitions using a pool of 6 threads
2016-09-10 17:57:31,425 [joshelser.YcsbBatchScanner] INFO : Queries 
executed in 2150 ms
2016-09-10 17:57:31,425 [joshelser.YcsbBatchScanner] INFO : Executing 6 
range partitions using a pool of 6 threads
2016-09-10 17:57:33,487 [joshelser.YcsbBatchScanner] INFO : Queries 
executed in 2061 ms
2016-09-10 17:57:33,487 [joshelser.YcsbBatchScanner] INFO : Executing 6 
range partitions using a pool of 6 threads
2016-09-10 17:57:35,628 [joshelser.YcsbBatchScanner] INFO : Queries 
executed in 2140 ms

Query code is available https://github.com/joshelser/accumulo-range-binning

Sven Hodapp wrote:
> Hi Keith,
>
> I've tried it with 1, 2 or 10 threads. Unfortunately there where no amazing 
> differences.
> Maybe it's a problem with the table structure? For example it may happen that 
> one row id (e.g. a sentence) has several thousand column families. Can this 
> affect the seek performance?
>
> So for my initial example it has about 3000 row ids to seek, which will 
> return about 500k entries. If I filter for specific column families (e.g. a 
> document without annotations) it will return about 5k entries, but the seek 
> time will only be halved.
> Are there to much column families to seek it fast?
>
> Thanks!
>
> Regards,
> Sven
>

RE: Accumulo Seek performance

Reply via email to