Mario Pastorelli wrote:
Hi, I'm currently having some scan speed issues with Accumulo and I would like to understand why and how can I solve it. I have geographical data and I use as primary key the day and then the geohex, which is a linearisation of lat and lon. The reason for this key is that I always query the data for one day but for a set of geohexes with represent a zone, so with this schema I can scan use a single scan to read all the data for one day with few seeks. My problem is that the scan is painfully slow: for instance, to read 5617019 rows it takes around 17 seconds and the scan speed is 13MB/s, less than 750k scan entries/s and around 300 seeks. I enable the tracer and this is what I've got
13MB/s sounds like you're only actually querying one TabletServer. Dave and Andrew hit the nail on the head suggesting some sharding on the rowId. That will help get more servers involved in servicing your query.
You can also try turning on TRACE logging via log4j on org.apache.accumulo.core.client.impl. That should give you some insight about what the client is actually doing WRT RPCs.
17325+0 Dice@srv1 Dice.query 11+1 Dice@srv1 scan 11+1 Dice@srv1 scan:location 5+13 Dice@srv1 scan 5+13 Dice@srv1 scan:location 4+19 Dice@srv1 scan 4+19 Dice@srv1 scan:location 5+23 Dice@srv1 scan 4+24 Dice@srv1 scan:location I'm not sure how to speedup the scanning. I have the following question: - is this speed normal? - can I involve more servers in the scan? Right now only two server have the ranges but with a cluster of 15 machines it would be nice to involve more of them. Is it possible? Thanks, Mario
