Re: Accumulo Seek performance

Josh Elser Mon, 12 Sep 2016 09:16:04 -0700

5 iterations, figured that would be apparent from the log messages :)


The code is already posted in my original message.

Adam Fuchs wrote:

Josh,

Two questions:

1. How many iterations did you do? I would like to see an absolute
number of lookups per second to compare against other observations.

2. Can you post your code somewhere so I can run it?

Thanks,
Adam


On Sat, Sep 10, 2016 at 3:01 PM, Josh Elser <[email protected]
<mailto:[email protected]>> wrote:

    Sven, et al:

    So, it would appear that I have been able to reproduce this one
    (better late than never, I guess...). tl;dr Serially using Scanners
    to do point lookups instead of a BatchScanner is ~20x faster. This
    sounds like a pretty serious performance issue to me.

    Here's a general outline for what I did.

    * Accumulo 1.8.0
    * Created a table with 1M rows, each row with 10 columns using YCSB
    (workloada)
    * Split the table into 9 tablets
    * Computed the set of all rows in the table

    For a number of iterations:
    * Shuffle this set of rows
    * Choose the first N rows
    * Construct an equivalent set of Ranges from the set of Rows,
    choosing a random column (0-9)
    * Partition the N rows into X collections
    * Submit X tasks to query one partition of the N rows (to a thread
    pool with X fixed threads)

    I have two implementations of these tasks. One, where all ranges in
    a partition are executed via one BatchWriter. A second where each
    range is executed in serial using a Scanner. The numbers speak for
    themselves.

    ** BatchScanners **
    2016-09-10 17:51:38,811 [joshelser.YcsbBatchScanner] INFO : Shuffled
    all rows
    2016-09-10 17:51:38,843 [joshelser.YcsbBatchScanner] INFO : All
    ranges calculated: 3000 ranges found
    2016-09-10 17:51:38,846 [joshelser.YcsbBatchScanner] INFO :
    Executing 6 range partitions using a pool of 6 threads
    2016-09-10 17:52:19,025 [joshelser.YcsbBatchScanner] INFO : Queries
    executed in 40178 ms
    2016-09-10 17:52:19,025 [joshelser.YcsbBatchScanner] INFO :
    Executing 6 range partitions using a pool of 6 threads
    2016-09-10 17:53:01,321 [joshelser.YcsbBatchScanner] INFO : Queries
    executed in 42296 ms
    2016-09-10 17:53:01,321 [joshelser.YcsbBatchScanner] INFO :
    Executing 6 range partitions using a pool of 6 threads
    2016-09-10 17:53:47,414 [joshelser.YcsbBatchScanner] INFO : Queries
    executed in 46094 ms
    2016-09-10 17:53:47,415 [joshelser.YcsbBatchScanner] INFO :
    Executing 6 range partitions using a pool of 6 threads
    2016-09-10 17:54:35,118 [joshelser.YcsbBatchScanner] INFO : Queries
    executed in 47704 ms
    2016-09-10 17:54:35,119 [joshelser.YcsbBatchScanner] INFO :
    Executing 6 range partitions using a pool of 6 threads
    2016-09-10 17:55:24,339 [joshelser.YcsbBatchScanner] INFO : Queries
    executed in 49221 ms

    ** Scanners **
    2016-09-10 17:57:23,867 [joshelser.YcsbBatchScanner] INFO : Shuffled
    all rows
    2016-09-10 17:57:23,898 [joshelser.YcsbBatchScanner] INFO : All
    ranges calculated: 3000 ranges found
    2016-09-10 17:57:23,903 [joshelser.YcsbBatchScanner] INFO :
    Executing 6 range partitions using a pool of 6 threads
    2016-09-10 17:57:26,738 [joshelser.YcsbBatchScanner] INFO : Queries
    executed in 2833 ms
    2016-09-10 17:57:26,738 [joshelser.YcsbBatchScanner] INFO :
    Executing 6 range partitions using a pool of 6 threads
    2016-09-10 17:57:29,275 [joshelser.YcsbBatchScanner] INFO : Queries
    executed in 2536 ms
    2016-09-10 17:57:29,275 [joshelser.YcsbBatchScanner] INFO :
    Executing 6 range partitions using a pool of 6 threads
    2016-09-10 17:57:31,425 [joshelser.YcsbBatchScanner] INFO : Queries
    executed in 2150 ms
    2016-09-10 17:57:31,425 [joshelser.YcsbBatchScanner] INFO :
    Executing 6 range partitions using a pool of 6 threads
    2016-09-10 17:57:33,487 [joshelser.YcsbBatchScanner] INFO : Queries
    executed in 2061 ms
    2016-09-10 17:57:33,487 [joshelser.YcsbBatchScanner] INFO :
    Executing 6 range partitions using a pool of 6 threads
    2016-09-10 17:57:35,628 [joshelser.YcsbBatchScanner] INFO : Queries
    executed in 2140 ms

    Query code is available
    https://github.com/joshelser/accumulo-range-binning
    <https://github.com/joshelser/accumulo-range-binning>


    Sven Hodapp wrote:

        Hi Keith,

        I've tried it with 1, 2 or 10 threads. Unfortunately there where
        no amazing differences.
        Maybe it's a problem with the table structure? For example it
        may happen that one row id (e.g. a sentence) has several
        thousand column families. Can this affect the seek performance?

        So for my initial example it has about 3000 row ids to seek,
        which will return about 500k entries. If I filter for specific
        column families (e.g. a document without annotations) it will
        return about 5k entries, but the seek time will only be halved.
        Are there to much column families to seek it fast?

        Thanks!

        Regards,
        Sven

Re: Accumulo Seek performance

Reply via email to