Re: Optimize Accumulo scan speed

Josh Elser Sun, 10 Apr 2016 20:38:50 -0700


Mario Pastorelli wrote:

Hi,

I'm currently having some scan speed issues with Accumulo and I would
like to understand why and how can I solve it. I have geographical data
and I use as primary key the day and then the geohex, which is a
linearisation of lat and lon. The reason for this key is that I always
query the data for one day but for a set of geohexes with represent a
zone, so with this schema I can scan use a single scan to read all the
data for one day with few seeks. My problem is that the scan is
painfully slow: for instance, to read 5617019 rows it takes around 17
seconds and the scan speed is 13MB/s, less than 750k scan entries/s and
around 300 seeks. I enable the tracer and this is what I've got

13MB/s sounds like you're only actually querying one TabletServer. Daveand Andrew hit the nail on the head suggesting some sharding on therowId. That will help get more servers involved in servicing your query.

You can also try turning on TRACE logging via log4j onorg.apache.accumulo.core.client.impl. That should give you some insightabout what the client is actually doing WRT RPCs.

17325+0 Dice@srv1 Dice.query
11+1 Dice@srv1 scan 11+1 Dice@srv1 scan:location
5+13 Dice@srv1 scan 5+13 Dice@srv1 scan:location
4+19 Dice@srv1 scan 4+19 Dice@srv1 scan:location
5+23 Dice@srv1 scan 4+24 Dice@srv1 scan:location
I'm not sure how to speedup the scanning. I have the following question:
   - is this speed normal?
   - can I involve more servers in the scan? Right now only two server
have the ranges but with a cluster of 15 machines it would be nice to
involve more of them. Is it possible?

Thanks,
Mario

Re: Optimize Accumulo scan speed

Reply via email to