Re: Optimize Accumulo scan speed

Andrew Hulbert Sun, 10 Apr 2016 10:01:21 -0700

I wonder if doing a full compaction on the table in the shell might helpsome as well...though I don't know it will vastly increase performance.The other option is lowing the split size for tablets for moreparallelism but that probably isn't scalable.

Back to the original query plan, I wonder if the 300 seeks could bereduced some how by forming tighter ranges...are you able to get anytiming on a scan of a range without the seeks?


On 04/10/2016 12:47 PM, Mario Pastorelli wrote:

I'm using a BatchScanner because I don't care about the order.

The sharding is indeed a good idea which I've already tested in thepast. The only problem that I've found with it is that there is no wayto be sure that the n ranges will be evenly distributed among the nmachines. Tablets are mapped to blocks and HDFS decides where to putthem so you could end up with two or more tablets of the same rangebut different shards put on the same machine and disk.

Anyway, performance were better than not having sharding, so I willreenable it and do some tests with the number of shards.

On Sun, Apr 10, 2016 at 5:25 PM, Andrew Hulbert <[email protected]<mailto:[email protected]>> wrote:


    Mario,

    Are you using a Scanner or a BatchScanner?

    One thing we did in the past with a geohash-based schema was to
    prefix a shard ID in front of the geohash that allows you to
    involve all the tservers in the scan. You'd multiply your ranges
    by the number of tservers you have but if the client is not the
    bottleneck then it may increase your throughput.

    Andrew


    On 04/10/2016 11:05 AM, Mario Pastorelli wrote:

    Hi,

    I'm currently having some scan speed issues with Accumulo and I
    would like to understand why and how can I solve it. I have
    geographical data and I use as primary key the day and then the
    geohex, which is a linearisation of lat and lon. The reason for
    this key is that I always query the data for one day but for a
    set of geohexes with represent a zone, so with this schema I can
    scan use a single scan to read all the data for one day with few
    seeks. My problem is that the scan is painfully slow: for
    instance, to read 5617019 rows it takes around 17 seconds and the
    scan speed is 13MB/s, less than 750k scan entries/s and around
    300 seeks. I enable the tracer and this is what I've got

    17325+0 Dice@srv1 Dice.query
    11+1 Dice@srv1 scan 11+1 Dice@srv1 scan:location
    5+13 Dice@srv1 scan 5+13 Dice@srv1 scan:location
    4+19 Dice@srv1 scan 4+19 Dice@srv1 scan:location
    5+23 Dice@srv1 scan 4+24 Dice@srv1 scan:location
    I'm not sure how to speedup the scanning. I have the following
    question:
      - is this speed normal?
      - can I involve more servers in the scan? Right now only two
    server have the ranges but with a cluster of 15 machines it would
    be nice to involve more of them. Is it possible?

    Thanks,
    Mario

--Mario Pastorelli| TERALYTICS


    *software engineer*

    Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
    phone:+41794381682 <tel:%2B41794381682>
    email: [email protected]
    <mailto:[email protected]>
    www.teralytics.net <http://www.teralytics.net/>

    Company registration number: CH-020.3.037.709-7 | Trade register
    Canton Zurich
    Board of directors: Georg Polzer, Luciano Franceschina, Mark
    Schmitz, Yann de Vries

    This e-mail message contains confidential information which is
    for the sole attention and use of the intended recipient. Please
    notify us at once if you think that it may not be intended for
    you and delete it immediately.





--
Mario Pastorelli| TERALYTICS

*software engineer*

Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
phone:+41794381682

email: [email protected]<mailto:[email protected]>

www.teralytics.net <http://www.teralytics.net/>

Company registration number: CH-020.3.037.709-7 | Trade registerCanton ZurichBoard of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz,Yann de Vries

This e-mail message contains confidential information which is for thesole attention and use of the intended recipient. Please notify us atonce if you think that it may not be intended for you and delete itimmediately.

Re: Optimize Accumulo scan speed

Reply via email to