We are trying to understand Accumulo performance to better plan our future
products that use it and we noticed that the read speed of Accumulo tends
to be way lower than what we would expect. We have a testing cluster with 4
HDFS+Accumulo nodes and we ran some tests. We wrote two programs to write
to HDFS and Accumulo and two programs to read/scan from HDFS and Accumulo
the same number of records containing random bytes. We run all the programs
from outside the cluster, on another node of the rack that doesn’t have
HDFS nor Accumulo.

We also wrote all the HDFS blocks and Accumulo tablets on the same machine
of the cluster.

First of all, we wrote 10M entries to HDFS were each entry was 50 bytes
each. This resulted in 4 blocks on HDFS. Reading this records with a
FSDataInputStream takes around 5.7 seconds with an average speed of around
90MB per second.

Then we wrote 10M entries to HDFS where each entry has a row of 50 random
bytes, no column and no value. Writing is as fast as writing to HDFS modulo
the compaction that we run at the end. The generated table has 1 tablet and
obviously 10M records all on the same cluster. We waited for the compaction
to finish, then we opened a scanner without setting the range and we read
all the records. This time, reading the data took around 20 seconds with
average speed of 25MB/s and 500000 records/s together with ~500 seeks/s. We
have two questions about this result:

1 - is this kind of performance expected?

2 - Is there any configuration that we can change to improve the scan speed?

3 - why there are 500 seeks if there is only one tablet and we read
sequentially all its bytes? What are those seeks doing?

We tried to use a BatchScanner with 1, 5 and 10 threads but the speed was
the same or even worse in some cases.

I can provide the code that we used as well as information about our
cluster configuration if you want.

Thanks,
Mario

-- 
Mario Pastorelli | TERALYTICS

*software engineer*

Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
phone: +41794381682
email: mario.pastore...@teralytics.ch
www.teralytics.net

Company registration number: CH-020.3.037.709-7 | Trade register Canton
Zurich
Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann
de Vries

This e-mail message contains confidential information which is for the sole
attention and use of the intended recipient. Please notify us at once if
you think that it may not be intended for you and delete it immediately.

Reply via email to