Some other thoughts in addition to the sharding:

 

1. Are your tablets spread out evenly across your tablet servers?

2. How many threads are you using in your batch scanner?

3. What is the table.scan.max.memory setting? 

 

From: Andrew Hulbert [mailto:[email protected]] 
Sent: Sunday, April 10, 2016 1:01 PM
To: [email protected]
Subject: Re: Optimize Accumulo scan speed

 

I wonder if doing a full compaction on the table in the shell might help some 
as well...though I don't know it will vastly increase performance. The other 
option is lowing the split size for tablets for more parallelism but that 
probably isn't scalable.

Back to the original query plan, I wonder if the 300 seeks could be reduced 
some how by forming tighter ranges...are you able to get any timing on a scan 
of a range without the seeks?

On 04/10/2016 12:47 PM, Mario Pastorelli wrote:

I'm using a BatchScanner because I don't care about the order.

The sharding is indeed a good idea which I've already tested in the past. The 
only problem that I've found with it is that there is no way to be sure that 
the n ranges will be evenly distributed among the n machines. Tablets are 
mapped to blocks and HDFS decides where to put them so you could end up with 
two or more tablets of the same range but different shards put on the same 
machine and disk.

Anyway, performance were better than not having sharding, so I will reenable it 
and do some tests with the number of shards.

 

On Sun, Apr 10, 2016 at 5:25 PM, Andrew Hulbert <[email protected]> wrote:

Mario,

Are you using a Scanner or a BatchScanner?

One thing we did in the past with a geohash-based schema was to prefix a shard 
ID in front of the geohash that allows you to involve all the tservers in the 
scan. You'd multiply your ranges by the number of tservers you have but if the 
client is not the bottleneck then it may increase your throughput.

Andrew 

 

On 04/10/2016 11:05 AM, Mario Pastorelli wrote:

Hi,

I'm currently having some scan speed issues with Accumulo and I would like to 
understand why and how can I solve it. I have geographical data and I use as 
primary key the day and then the geohex, which is a linearisation of lat and 
lon. The reason for this key is that I always query the data for one day but 
for a set of geohexes with represent a zone, so with this schema I can scan use 
a single scan to read all the data for one day with few seeks. My problem is 
that the scan is painfully slow: for instance, to read 5617019 rows it takes 
around 17 seconds and the scan speed is 13MB/s, less than 750k scan entries/s 
and around 300 seeks. I enable the tracer and this is what I've got

17325+0 Dice@srv1 Dice.query
11+1 Dice@srv1 scan 11+1 Dice@srv1 scan:location
5+13 Dice@srv1 scan 5+13 Dice@srv1 scan:location
4+19 Dice@srv1 scan 4+19 Dice@srv1 scan:location
5+23 Dice@srv1 scan 4+24 Dice@srv1 scan:location

I'm not sure how to speedup the scanning. I have the following question:

  - is this speed normal?

  - can I involve more servers in the scan? Right now only two server have the 
ranges but with a cluster of 15 machines it would be nice to involve more of 
them. Is it possible?

Thanks,

Mario

 

-- 

Mario Pastorelli | TERALYTICS

software engineer

Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland 
phone: +41794381682 <tel:%2B41794381682> 
email: [email protected]
www.teralytics.net

Company registration number: CH-020.3.037.709-7 | Trade register Canton Zurich
Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann de 
Vries

This e-mail message contains confidential information which is for the sole 
attention and use of the intended recipient. Please notify us at once if you 
think that it may not be intended for you and delete it immediately.

 




-- 

Mario Pastorelli | TERALYTICS

software engineer

Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland 
phone: +41794381682
email: [email protected]
 <http://www.teralytics.net/> www.teralytics.net

Company registration number: CH-020.3.037.709-7 | Trade register Canton Zurich
Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann de 
Vries

This e-mail message contains confidential information which is for the sole 
attention and use of the intended recipient. Please notify us at once if you 
think that it may not be intended for you and delete it immediately.

 

Reply via email to